zondag 17 november 2013

Tabula to get your data out of a PDF

You all know it; working on a data journalism project; finally found the data.....but it is pdf format. How to get the pdf data in a spreadsheet? Well there are some web services like cometdocs or pdftoexcelonline. Or you could try to build a scraper yourself, but then you have to read Paul Bradshaw, "Scraping for Journalists" first.

Memeburn: http://memeburn.com/2013/11/the-5-minute-guide-to-scraping-data-from-pdfs/

Now your problems are solved, at least almost, with Tabula. "Tabula is a tool for liberating data tables trapped inside PDF files". Awesome, import your pdf, push a button and there is your spreadsheet format! You save the scraped page in CVS and import it in any spreadsheet program.
Small problem is that Tabula only scrapes one pdf page at a time. So 10 pdf pages of data give you 10 spreadsheets.
Installing Tabula is piece of cake; download, unzip and run. Tabula is written in Jave and uses Ruby for scraping. This is exactly one of languages used on Scraperwiki to build taylormade pdf scrapers.
Courtesy to Arlen Poort of NRC Handelsblad who pointed me to Tabula, in a training session for the VVOJ)

Geen opmerkingen:

Een reactie plaatsen

Opmerking: alleen leden van deze blog kunnen een reactie plaatsen.