donderdag 14 maart 2019

KAGGLE: IS THERE DATA JOURNALISM IN MACHINE LEARNING(4)

The answer is YES, there is data journalism in machine learning. For example there are sessions at IRE 2019 training for machine learning. Here are a few other links relating data journalism and machine learning:

Is there any use of the learning in the newsroom?


What could be the output for data journalist working with machine learning?
I have done some experiments in R at Kaggle using a data set about Dutch municipalities.

1. Clustering using kmeans: makes it possible to generate meaningful clusters of municipalities based on income, population, house value, unemployment etc. Although not very impressive, it is possible to create various centers or clusters  in the data: large municipalities; high income municipalities, and high unemployment. Here is the kernel with the coding and the results: https://www.kaggle.com/peterverweij/clustering-gemeentedata-using-kmeans 


2. Predicting using a simple Linear Model: using two variables in the data set: income and house value. A plot of the data show that there is a strong relationship, shown also by the regression line. Creating a linear model for these variables makes it possible to predict for example income from the value of the house
3. Predicting using randomForrest: The linear model is simple and works for interval variables. But predicting nominal values, for example gender or political party of the mayor, requires a different approach. RandomForrest provided interesting outcomes.
Here is the link to both predicting models: https://www.kaggle.com/peterverweij/prediction-simple-machine-learning
Dat journalists have to dig deep into statistics, but these example show that there is added value for reporting. These example are of course limited; there is a whole set of different machine learning algorithms in R available; i have only tried two. Here is the list:
https://www.kdnuggets.com/2015/06/top-20-r-machine-learning-packages.html

https://www.r-bloggers.com/what-are-the-best-machine-learning-packages-in-r/ 

zaterdag 2 maart 2019

LESSONS FROM KAGGLE(3)

The good thing of Kaggle from the perspective of data journalism is that Kaggle holds an interesting collection of datasets, ranging from the Olympics to Economic Freedom and Sovereign debt.
The formats of the datasets vary from csv to sdl and json.The datasets can be downloaded used for private analysis. However once you selected a database you can search for the kernel; that is the analysis of the data.

Let's take an example: search for data using the tag 'economics'; select the 'Economic Freedom Report'(2018). Scroll down and get an idea of the data. Next top left and select 'open in': either in google sheets or google data studio; use copy API command to download to your own machine.

But before you start working, check what others have been doing with the data. Click on kernels. There are 9 kernels available; in R and in Python. Opening the first one, focussing on 'IS AN ECONOMICALLY FREE COUNTRY A BETTER PLACE TO LIVE?' Perhaps it is useful for your writing, perhaps not and you start analyzing yourself; however looking at these example gives you a nice intro in how to do data analysis.

When you have chosen a dataset you can start a new kernel, and your own analysis. This is interesting because all the software you are using is in the cloud; you are working on a virtual machine in the cloud with R and Python installed, and it runs the code you are entering....all for free. You can save your work; make it public, save with others and ask for comment. I think this is a very strong point of Kaggle.

Here is my example. I uploaded some old data about the Dutch municipalities and used R for some analysis. Have a look at the results:
Dataset in csv:https://www.kaggle.com/peterverweij/data-about-dutch-municipalitiestest
Kernel in R: https://www.kaggle.com/peterverweij/gemeente-test

vrijdag 1 maart 2019

LEARNING R FOR DATA JOURNALISTS

In the seventies of the past century I was trained in using SPSS (on a main frame) for analyzing social data. But when I lost the free licence to use SPSS, I had to find an alternative. Of course spreadsheets were available but that it is not real statistics. Then I found out that R could load a GUI for SPSS. Interesting, but I learned though that R had more possibilities: packages for all kind of analysis. Second I found RStudio, a perfect GUI for the R.

But how do you get started with R, still remembering that data journalism is the goal.


Why?
5 reason for using R:
http://memeburn.com/2014/05/5-compelling-arguments-for-using-r-in-data-journalism/ 


Here is a start to learn the basics

Computerworld published a The Beginner’s Guide to R. This 30-page guide will show you how to install R, load data, run analyses, make graphs, and more.

Complete guide as PDF download: https://www.computerworld.com/article/2884322/learn-r-programming-basics-with-our-pdf.html#tk.ctw-eos

Other Resources published by Computerworld:
60+ R resources to improve your data skills: https://www.computerworld.com/article/2497464/top-r-language-resources-to-improve-your-data-skills.html

Overview: Quick R: http://www.statmethods.net/ 

And finally an intro to R coming from Kaggle: https://www.kaggle.com/hamelg/intro-to-r-index  or  http://hamelg.blogspot.com/2015/10/introduction-to-r-index.html?view=classic

Teach yourself basic R, interactive in R studio using swirl pacakge.
Swirl, an R package designed to teach you R straight from the command line. Swirl provides exercises and feedback from within your R session to help you learn in a structured, interactive way.
Swirl, more https://www.rstudio.com/online-learning/

Here is a free book covering all the basics:
Import the pdf and read this intro to R on Kindle:
https://www.r-bloggers.com/yarrr-the-pirates-guide-to-r-2/ 

Keep informed about the latest developments and follow: https://www.r-bloggers.com/ 

WHAT KAGGLE CAN TEACH DATA JOURNALISTS (2)

Although Kaggle focuses on data science and machine learning there is a lot of interesting knowledge and ideas data journalists can get out of this platform.

Data
The easiest way to get data nuggets out this goldmine is by using Google Data Studio( see also: https://d3-media.blogspot.com/2019/02/and-even-more-viz-tools.html ) and or Google Sheets (Google Drive). Login to Kaggle and find a data set by keyword or tag. Click on three dots in the right top corner and choose one of the two option: sheets or data studio.

The second option for getting data is using the API( see: https://www.kaggle.com/docs/api ), choose in the menu: copy API command and use that command in the terminal.  A zip file, with data in .csv format  will be downloaded. From here you can analyze and visualize. For example by importing the data set in R studio.  But a lot of work has already been done.

Kernel
Let's look for a kernel. Again search for olympics and set language to R. click on the kernel you like, and explore the coding and vizs. In the end you can  choose for downloading the code into a jupyter notebook, or use the API. (change the file type from .irnb to .ipynb) if you have the data ready you can run the code step by step.

Advanced
Now let's go for the advanced stuff and dig into datascience. Could I run a kernel on the Kaggle platform? We leave data journalism and are fully using the possibilities of Kaggle.
Watch the following short videos: Getting Started on Kaggle: Python coding in Kernels   and
How to Make a Data Science Project with Kaggle