Since the
beginnings of data journalism in the nineties of the last century, then called
CARR or Computer Assisted Research and Reporting, techniques for analyzing and visualizing data have improved
enormously. One of the central tools in te nineties was the spreadsheet,
standardized by Microsoft Excel. Spreadsheets are still much used for analysis
though moving into the area of advanced data journalism: using for example R
for deeper statistical analysis or D3 for creating better interactive graphics
creates various new challenges. Then you often will engage in different types
of coding: I got struck between Python (for R) or JavaScript (for D3). Does a
data journalists need to learn all these programming languages or is there an
easier and faster solution?
Looking at
journalism practice the answer is: step
on the steep learning curve and start
with learning how to code. Here is some help. Paul Bradshaw starts next year an
MA in Data Journalism at the Birmingham School of Media.
Studying “ Coding
and computational thinking being applied journalistic ally (I
cover using JavaScript, R, and Python, command line, SQL and Regex to pursue
stories)” is one of the elements of this new MA, writes Bradshaw on his blog.
Looking into the
market, there is really demand for data journalist with coding skills. Here is
a job listing from the Economist. One of the preferred qualities include: A good
understanding of data analytics and Coding skills (JavaScript and Python), or a
background in data journalism, are a plus.
In the following I
will argue that a basic understanding of coding is very helpful, but new
services on the web help data journalists to avoid getting stuck in coding.
Static
Data used for creating graph are from a small survey about mayors in the Netherlands and can be found here.
Excel created the
possibility to analyze data and visualize the results of an analysis. Here is
an simple example. Showing the distribution of gender for Dutch mayors in
percentages.
Simple bar
graphs in Excel
This is a
straightforward bar graph, simply showing that 80% is man and 20% woman. The
picture helps to understand the numbers through visualization. For a simple
document or a report this works fine. From a data journalism perspective there
are some problems.
1. The
visualization is just a picture. That is a small bitmap in for example
JPG format. The resolutions of these pictures is far too low to make it ready
for print of showing on a TV screen.
2. Publishing the
graph on a web page or blog is no problem. Resolution is OK for the web.
However it is a static picture, hovering over with a mouse does not
reveal extra information.
Let’s start with
the first problem. What to do? Should the graphics editor import the data from
the journalist and create the graphic from scratch using data and rework them
with for example a program like Illustrator or Inkscape. That would be double work, wasting time and
energy.
Using an other spreadsheets program then Excel,
Calc of LibreOffice, we can export the
graph as .SVG, scalable vector graphics. Now the graph is not a bit map but a vector
map and can be edited and made ready for high resolution print or for TV screens.
Making the graph
interactive is possible using for example Google charts. Then we have to import
the data into Google charts or work with Google (spread)sheets.
Simple maps
When producing
maps the situation is almost the same. Here is a map showing political parties
of mayors for Dutch municipalities. Again this map is a bitmap. Mapping
programs like QGIS offer the possibility to export the map as .svg.
However the map is
not interactive on the web. Google FT creates a solution for this.
Here is an example of the map in Google FT. The map is OK for the web, using an embedded link to publish, although
one can discuss the quality of the map. For print the resolution of the map is
too small. Either a screen dump or exporting the map creates a bitmap with low
resolution. Should we use two software programs for mapping. I skip the possibility of creating your own
web/ map server using QGIS.
Tableau
A nice solution
for bringing analysis and visualization into one application is a free tool and
service Tableau public. It is easy to use for calculations and creates nice
graphs and dashboards, which can be exported as .pdf or jpg, or as embedded
link. Another important possibility is that Tableau can import statistical data
(for example the result of calculations with R) and reads geographical data(
for example .shp file for maps which can be joined with data files). Tableau is
in my opinion one of the best services for data journalism. Here is an example
showing a dashboard with the distribution of gender over political parties and a map showing gender of the mayor per municipality. The link show the interactivity of the charts, which is useful
for on line.
The exported graph
is a simple bitmap with low resolution. Missing here is the possibility to
export to .svg.
Data Driven
Documents
An other solution
is to use D3.js. D3 or Data Driven Document is a library for creating
visualization on the web making full use of the following possibilities. More
about D3, examples and tutorials. D3 uses data in documents for
visualization applying:
- html: web
document;
- css: style
sheets of the web document;
- svg: graphics as
a text file;
- js and json:
javascript for manipulating and
importing data;
- d3: library with
different documents for visualizations using the above tools.
This is an
important step forward in building interesting and high quality
visualizations for the web. And because of the use of .svg the charts can
be rebuild for print. There is a small problem: creating D3 graphics
presupposes some knowledge of JavaScript.
Here is the script,
I edited an example of Mike Bostock, for a graph showing the distribution of
political parties.
<!DOCTYPE html>
<meta charset="utf-8">
<style>
.bar {
fill:
steelblue;
}
.bar:hover {
fill: brown;
}
.axis--x path {
display: none;
}
</style>
<svg width="960"
height="500"></svg>
<script
src="https://d3js.org/d3.v4.min.js"></script>
<script>
var svg = d3.select("svg"),
margin = {top:
20, right: 20, bottom: 30, left: 40},
width =
+svg.attr("width") - margin.left - margin.right,
height =
+svg.attr("height") - margin.top - margin.bottom;
var x = d3.scaleBand().rangeRound([0,
width]).padding(0.1),
y =
d3.scaleLinear().rangeRound([height, 0]);
var g = svg.append("g")
.attr("transform",
"translate(" + margin.left + "," + margin.top +
")");
d3.tsv("partij.tsv", function(d) {
d.aantal =
+d.aantal;
return d;
}, function(error, partij) {
if (error)
throw error;
x.domain(partij.map(function(d)
{ return d.partij; }));
y.domain([0,
d3.max(partij, function(d) { return d.aantal; })]);
g.append("g")
.attr("class",
"axis axis--x")
.attr("transform",
"translate(0," + height + ")")
.call(d3.axisBottom(x));
g.append("g")
.attr("class",
"axis axis--y")
.call(d3.axisLeft(y).ticks(10,))
.append("text")
.attr("transform",
"rotate(-90)")
.attr("y",
6)
.attr("dy",
"0.71em")
.attr("text-anchor",
"end")
.text("aantal");
g.selectAll(".bar")
.data(partij)
.enter().append("rect")
.attr("class",
"bar")
.attr("x",
function(d) { return x(d.partij); })
.attr("y",
function(d) { return y(d.aantal); })
.attr("width",
x.bandwidth())
.attr("height",
function(d) { return height - y(d.aantal); });
});
</script>
And here are the
data in tab separated value (partij.tsv)
partij aantal
1 CDA 115
2 CU 11
3 D66 19
4 GEEN 2
5 GL 7
6 OVG 7
7 PVDA 73
8 SGP 7
9 VVD 94
Running the script
with the data in a web server create an interactive bar graph.
R Project
From a statistical
perspective spreadsheets are a bit limited for analysis. R project- using R
studio, is a better tool. I have given 5 compelling reasons for using R in data journalism. Here is an example taken from the same data set about municipalities and
the mayors. The data contain two variables: average income and average house
price per municipality. Having loaded the data set in R, we are going to make
the same plot for political parties
gemeente<-read.csv("burg.csv",
header=TRUE, sep=";")
str(gemeente) #structure of the data
'data.frame': 335 obs.
of 9 variables:
$ Gemeente : Factor w/ 335 levels "Aa en
Hunze",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Provincie : Factor w/ 12 levels
"D","F","FL","G",..: 1 7 4 2 12 12 8 9
3 12 ...
$ Burgemeester : Factor w/ 335 levels "Aartsen J.J.
van",..: 205 196 25 91 213 316 54 92 320 270 ...
$ Vanaf : Factor w/ 268 levels
"10-1-2012","1-10-2001",..: 99 159 3 60 144 146 131 215 268
81 ...
$ Geslacht : Factor w/ 2 levels
"M","V": 1 1 1 1 1 1 1 1 1 2 ...
$ Geboortedatum : Factor w/ 331 levels
"10-10-1969","10-11-1955",..: 297 253 76 224 211 61 126 48
164 298 ...
$ jaar : int 1961 1956 1949 1952 1968 1963 1955 1970 1964
1966 ...
$ leeftijd : int
56 61 68 65 49 54 62 47 53 51 ...
$ Politieke.Partij: Factor w/ 9 levels
"CDA","CU","D66",..: 7 1 1 6 9 7 1 9 3 1 ...
p<-table(p)
p
p
CDA
CU D66 GEEN GL
OVG PVDA SGP VVD
115
11 19 2
7 7 73
7 94
names(p)
[1] "p"
"Freq"
colnames(p)[2]<-"aantal"
str(p)
'data.frame': 9 obs.
of 2 variables:
$ partij: Factor w/ 9 levels
"CDA","CU","D66",..: 1 2 3 4 5 6 7 8 9
$ aantal: int 115 11 19 2 7 7 73 7 94
Default the plot
can be exported as image or .pdf. With the following code we save the plot as
.svg
svglite("plot.svg",
width = 10, height = 7)
ggplot(p2, aes(x=partij,
y=aantal))+geom_bar(stat="identity")
dev.off() # create an .svg
from the ggplot
R is command line
based and works with Python, an other programming language. So there is
more coding to learn in advanced data journalism. There are many libraries or
packages in R for different statistical approaches. R studio is the best
environment to handle the packages, do the calculations from the command
line, and print or export the results.
The output of R is
generally a figure or a printed chart. Or we can export the result as a bit map
or as scalable vector graphics. This is fine for the hard copy media, but not
for on line. For producing for the web I want interactive graphics. The output
of R is better for scientific reports then for journalism. I skip the
possibility of Shiny server for interactive web applications in R.
Dilemma
Now we are stuck,
between R and D3. Can we have the best of both worlds? How to get into D3
from R? More.That is how to move from Python to Javascript? More
There are various
solutions.
- export the R
data to Tableau and make the graph in Tableau;
- export from R
direct into plot.ly and make a graph in this webservice.
The first option
is simply store your calculations in .rdata format and import in Tableau for
produce the graph.
In the secon
option we use the service of plot.ly and export
the visualization produced with R directly into plot.ly, using the
plotly package/library in R. That is converting the graphics produced by print package of R ggplot into D3 visualizations. Now we don’t have to worry about
JavaScript and D3 libraries, this all done by plot.ly. Although the number of different graphs which
can be used is a bit limited.
plot_ly(gemeente, x =
~Politieke.Partij) #create an interactive graph, which can be save as image or
webpage
After creating a login and API at plot.ly
we can create the graph at plot.ly. Here is the link: https://plot.ly/~peterverweij/48/
The chart is now available in D3 format
at plotly and can be edited and exported.
Various services on the web are very
helpful for data journalists to avoid deep problems of different coding. The
services are taking care of that. By exporting data to the service
visualizations of different styles and formats can be produced. Detailed knowledge
of Python or D3 is not needed, some basic insight will do to get the
visualizations out.
Geen opmerkingen:
Een reactie posten
Opmerking: Alleen leden van deze blog kunnen een reactie posten.