It
was work in progress, but after almost one year and 40 'versions'
later, it is published: Paul Bradshaw's Scraping for journalists.
Bradshaw is teaching at City University of both London and Birmingham and a respected data journalist.
You can order the e-book version,
available in PDF, Mobi or Epub. Leanpub, where you can obtain a
copy, has an interesting concept, it offers all the tools for the
production and for the publishing of a book. And, not unimportant,
the royalties are higher compared to traditional publishing.
Bradshaw: “I have become a huge fan...The
format combines the best qualities of traditional book publishing
with those of blogging and social media”.
Gepubliceerd op De Nieuwe Reporter: http://www.denieuwereporter.nl/2013/05/paul-bradshaws-recepten-voor-datasoep/ .
Gepubliceerd op Memeburn: http://memeburn.com/2013/05/aspiring-data-journalist-this-book-is-a-must-read/
Gepubliceerd op De Nieuwe Reporter: http://www.denieuwereporter.nl/2013/05/paul-bradshaws-recepten-voor-datasoep/ .
Gepubliceerd op Memeburn: http://memeburn.com/2013/05/aspiring-data-journalist-this-book-is-a-must-read/
Must Read
Scraping for journalists is a must
read for data journalists. One of the problems in this ball game is
how to get your data from the on line resources into a spreadsheet.
Scraping is the answer. But how do you do that, given the fact that
most journalists are not coders? In 30 chapters and almost 500 pages
Bradshaw gives his recipes for scraping data. The book is not for
reading from cover to cover but it is learning by doing. You follow
the recipes step by step on your computer, add some variation to the
example and finally you try to apply the recipes on your own data.
This works wonderful, because starting with programming takes too
much time before you get results. Now you have some ready made code,
which works, use and experiment untill you can successfully apply it
to your data.
Fast start
Already at chapter one you make as
scraping-journalist a fast start. Within five minutes you have
scraped your first data. Bradshaw starts with explaining the commands
ImportHTMl
(http://support.google.com/drive/bin/answer.py?hl=en&answer=155182
) and
ImportXML(http://support.google.com/drive/bin/answer.py?hl=en&answer=155184
) used in Google Docs spreadsheets to import data from a web page
into a spreadsheets. The trick is to find the right table or list of
the data. You can dig deep into the html or xml but you can also
guess and experiment. Just try some numbers in the expression,
advices Bradshaw.
Of course extracting tables from a
website can be done faster with a nice tool called Outwit Hub
(http://www.outwit.com/products/hub/
). You just load your data web page in Outwit and push the bottom
'table'; and there are your scraped data ready to be exported in
Excel format. The free version is nice but Bradshaw advices to buy
the official one for about 60 Euro's, because it does not have the
limitation of scraping only a hundred lines. Especially when you are
scraping multi pages. Take for example 150 members of a parliament.
They all have their own web pages. Structured in the same way; all
have a heading where the members state their education and former
jobs. Doing this by hand page after page is pretty boring and time
consuming. Make a scraper, based on the opening and end tags for
education and jobs, next run the scraper over the 150 individual
member pages. Take a cup of coffee and after a while, your data will
be ready for exporting to Excel. Bradshaw takes a great effort in
explaining how to find the opening and end tag in the html for the
data you are looking for. This makes sure you will get it working
after a while.
Scraperwiki
You are not the only journalist
who is scraping data. Scrapperwiki (https://scraperwiki.com/
) is the playground to meet your friends and share your skills. At
scraperwiki you will find various scrapers used by others to collect
data, copy them and make a revision for your own purposes and run it.
This sounds simple, however scrapers are written in code, generally
three languages are used, PHP, Ruby or Python. You don't have to be a
programmer to use the scripts. After the explanation of the structure
of a scraper by Bradshaw you can start experimenting yourself. And as
good educator and trainer Bradshaw gives you some assignments at the
end of each chapter.
Reading List
There is much more to discover: do
you know how to scrape a PDF; cells in a large spreadsheet, or data
in CSV file? In the book you will find the recipe. When I have shown
the trick in a training participants always ask: do you have this in
writing? Now there is. Therefore I will give the book a prominent
place on the reading list for my trainings in data journalism.
Scraping
for Journalists
By
Paul Bradshaw
$12,99 download:
https://leanpub.com/scrapingforjournalists
Geen opmerkingen:
Een reactie posten
Opmerking: Alleen leden van deze blog kunnen een reactie posten.