zaterdag 11 mei 2013

Recipes For Data Collection

It was work in progress, but after almost one year and 40 'versions' later, it is published: Paul Bradshaw's Scraping for journalists. Bradshaw is teaching at City University of both London and Birmingham and a respected data journalist. You can order the e-book version, available in PDF, Mobi or Epub. Leanpub, where you can obtain a copy, has an interesting concept, it offers all the tools for the production and for the publishing of a book. And, not unimportant, the royalties are higher compared to traditional publishing. Bradshaw: “I have become a huge fan...The format combines the best qualities of traditional book publishing with those of blogging and social media”.

Gepubliceerd op De Nieuwe Reporter: .
Gepubliceerd op Memeburn: 

Must Read
Scraping for journalists is a must read for data journalists. One of the problems in this ball game is how to get your data from the on line resources into a spreadsheet. Scraping is the answer. But how do you do that, given the fact that most journalists are not coders? In 30 chapters and almost 500 pages Bradshaw gives his recipes for scraping data. The book is not for reading from cover to cover but it is learning by doing. You follow the recipes step by step on your computer, add some variation to the example and finally you try to apply the recipes on your own data. This works wonderful, because starting with programming takes too much time before you get results. Now you have some ready made code, which works, use and experiment untill you can successfully apply it to your data.

Fast start
Already at chapter one you make as scraping-journalist a fast start. Within five minutes you have scraped your first data. Bradshaw starts with explaining the commands ImportHTMl ( ) and ImportXML( ) used in Google Docs spreadsheets to import data from a web page into a spreadsheets. The trick is to find the right table or list of the data. You can dig deep into the html or xml but you can also guess and experiment. Just try some numbers in the expression, advices Bradshaw.

Of course extracting tables from a website can be done faster with a nice tool called Outwit Hub ( ). You just load your data web page in Outwit and push the bottom 'table'; and there are your scraped data ready to be exported in Excel format. The free version is nice but Bradshaw advices to buy the official one for about 60 Euro's, because it does not have the limitation of scraping only a hundred lines. Especially when you are scraping multi pages. Take for example 150 members of a parliament. They all have their own web pages. Structured in the same way; all have a heading where the members state their education and former jobs. Doing this by hand page after page is pretty boring and time consuming. Make a scraper, based on the opening and end tags for education and jobs, next run the scraper over the 150 individual member pages. Take a cup of coffee and after a while, your data will be ready for exporting to Excel. Bradshaw takes a great effort in explaining how to find the opening and end tag in the html for the data you are looking for. This makes sure you will get it working after a while.

You are not the only journalist who is scraping data. Scrapperwiki ( ) is the playground to meet your friends and share your skills. At scraperwiki you will find various scrapers used by others to collect data, copy them and make a revision for your own purposes and run it. This sounds simple, however scrapers are written in code, generally three languages are used, PHP, Ruby or Python. You don't have to be a programmer to use the scripts. After the explanation of the structure of a scraper by Bradshaw you can start experimenting yourself. And as good educator and trainer Bradshaw gives you some assignments at the end of each chapter.

Reading List
There is much more to discover: do you know how to scrape a PDF; cells in a large spreadsheet, or data in CSV file? In the book you will find the recipe. When I have shown the trick in a training participants always ask: do you have this in writing? Now there is. Therefore I will give the book a prominent place on the reading list for my trainings in data journalism.

Scraping for Journalists
By Paul Bradshaw

Geen opmerkingen:

Een reactie posten

Opmerking: Alleen leden van deze blog kunnen een reactie posten.