d3-media: DATA JOURNALISM 2.0

Version 1.0

Analyzing and visualizing a table with figures for an article in a newspaper is not exceptional anymore. Take for example local taxes per municipality. Once you have downloaded the figures in a spreadsheet is not so difficult to to notice which of the local taxes generates the highest income for municipalities. A simple bar graph will do; or platting the tax income per municipality on a map will draw attention to the most expensive municipalities. Her are two examples made by Flourish about taxation for garbage collection in municipalities in the province of Gelderland. A bar graph: https://public.flourish.studio/visualisation/5152134/ and a map: https://public.flourish.studio/visualisation/5152371/ . Her is a dashboard made in Tableau: https://public.tableau.com/app/profile/verweijpjc/viz/reiniging/Dashboard1

Sometimes the data provider makes downloading data easy (http://d3-media.blogspot.com/2021/04/r-data-journalism-helpers.html ), bit often also provides analysis and visualisation(https://www.cbs.nl/nl-nl/nieuws/2021/04/gemeenten-begroten-11-3-miljard-euro-aan-heffingen-in-2021 ).

Here is the reporting of a local/regional newspaper about the taxes: https://www.gelderlander.nl/arnhem-e-o/kaart-deze-gemeenten-zijn-het-duurst-om-in-te-wonen~a217736c/

This is data journalism 1.0 and it is not the rocket science anymore it looked like 25 years ago. Data journalism 1.0 is almost 25 years old and the original beta version .0 is more tha 50 year old.(http://d3-media.blogspot.com/2018/07/new-steps-in-data-journalism.html ) A lot has been changed, most important is the tools are more easy to handle and require less skill and training. Data from a web page for example can directly imported in excel for example and visualizing based on a template from Flourish for example is almost standard procedure for journalists.

Version 2.0

On the other hand data software is developing fast. Data science is creating more software for analyzing data. Interesting for application in journalism is software generally known as Artificial Intelligence(AI) or Machine Learning(ML). You don’t need special machines for running those software, nor do you have to pay huge amounts for the use. A good laptop with a fast processor enough memory and a nice video device will do to run the software,which comes in two favors: R or Python. Skipping the discussion which is best, the difference in general is that both are capable of analyzing data; Python is a full programming language and R is more focused on statistics. I work mostly with R (http://d3-media.blogspot.com/2014/04/vijf-redenen-om-r-te-gebruiken-in-data.html ), running on a Linux operating system(OS)(http://d3-media.blogspot.com/2011/09/linux-voor-journalisten.html ). But Windows or Apple will also work.

R has a steep learning curve. Here is a howto start: http://d3-media.blogspot.com/2019/03/learning-r-for-data-journalists.html There is no graphical user interface(GUI), so you work from a terminal typing in commands or merge a set of command into a small program. R has libraries for special job, and one set of libraries is dedicated to ML AI and or ML has lots of application.

Here is an example analyzing a dateset of municipalities in NL . I will use this data set also for explaining Machine learning. This example shows a standard analysis of the data in R;

https://www.kaggle.com/peterverweij/gemeente-test

This example is shown in Kaggle; more about this interface: http://d3-media.blogspot.com/2019/02/data-journalists-what-do-you-know-about.html

Machine Learning

ML has for data journalism has various area’s for implementation or application:

Automated content production or robot journalism (https://memeburn.com/2014/03/what-a-californian-earthquake-can-teach-us-about-the-future-of-journalism/ ) is one of them which drawing at the moment much attention. Another area is content optimalization: optimizing the content for a specific user.

For data journalism the second area data mining is the most interesting. The following chart gives an overview of the possibilities:

((chart from: https://nl.mathworks.com/discovery/machine-learning.html )

I will not discuss all these programs in detail.

First I will discuss the basic idea of machine learning and second I will show some examples. On data set based on data about dutch municipalities I will use to show: linear regression; decision trees and neural networks. An other dataset based on twitter I will use for showing K means.

The core of AI or ML is a black box: you give the box data input, next the box starts doing complicated statistical operations (the algorithm), which you fine tune with various option, and finally you have the output. Take for example the Titanic, there is complete data set about the passengers. With ML it is possible to calculate your changes to survive the shipwreck. Or to predict whether the mayor of a dutch municipality will be male or female.

For using the Ml you don’t need exact knowledge about the black box, the algorithm itself. That is for the coders or data scientist how design this software. The basic question for your research is what do you want to do?

From the chart we see that there are two entries: supervised and unsupervised learning. Supervised learning means that model(algorithm) which predicts the gender of the mayor of the changes on survival, must be trained on a known data set. When the model is trained, and you know then the margins of error, it can be apllied to complete data set to the predictions.

In unsupervised learning the data is immediately read into the an algorithm which makes the best of it. For example, analyzing tweets from two populist members of the Dutch house of representatives (Wilders and Baudet) show that their tweets have different clusters. Meaning they are both right wing populist but focus on different issues. Creating the clusters is a mathematical operation with no control data.

Under supervised learning we have to areas: classification and regression.

Unsupervised learning focuses on clustering.

Examples of machine learning

0. Kaggle

I will use Google Kaggle interface to show the code of the various algorithms. Here is a general intro to kaggle.

- machine learning in kaggle: http://d3-media.blogspot.com/2019/03/kaggle-is-there-data-journalism-in.html

1. Regression means that in a simple case with two variables, the variation of one variable is related to the variation and the other variable. Average income in a city for example will relate to the average price of houses; the relative rich cities houses will be more expensive. This relationship between the variables based on the variation they have in common can be expressed in a number: correlation. Or by a line a in scatter diagram, that is linear model, commonly called the trend(line)

When the number of variables increases the prediction of the outcome of one variable becomes more complicated. An the we have to use an other algorithm or model for the prediction. With decision trees it is possible to predict the gender of a mayor based om population, income, house price and unemployment; or predict the political party of the mayor based on the same variables. Here the predicted outcome is a category, nominal level of measurement

Neural networks make it possible to estimate the value, the quantity of a variable, ratio level of measurement. For example an estimation of the income of city, based on the other variables.

- regression: decision trees - random forrest: https://www.kaggle.com/peterverweij/prediction-simple-machine-learning

- regression: decision trees with rpart: https://www.kaggle.com/peterverweij/gender-rpart-predict-gem

- regression: neural networks with tensor flow: https://www.kaggle.com/peterverweij/kernel-tensorflow-woz

2. Clustering

The goal of clustering is to group or cluster observations that have similar characteristics, This is an example of unsupervised learning, so we have no control or check. Inspection of the output is the test. In the example below we regroup municipalities.

- clustering: https://www.kaggle.com/peterverweij/clustering-gemeentedata-using-kmeans

3. Classification- classification with voyant using nearest neighbor :

http://d3-media.blogspot.com/2020/03/two-faces-of-twitter-populism-in.html

Literature

More background and detail on machine learning for data journalists:

https://towardsdatascience.com/10-machine-learning-methods-that-every-data-scientist-should-know-3cc96e0eeee9

and

https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/ )

and

http://d3-media.blogspot.com/2019/07/journalism-as-algorithm.html review of Automating the News. How Algorithms Are Rewriting the Media. By Nicholas Diakopoulos