donderdag 28 februari 2019


I was looking for an instructive data set to use at a data journalism training. After World Bank data focusing on GDP and life expectancy, or sovereign debt and sovereign ratings from Trading Economics, or financial inclusion, I needed a fresh start; something new, interesting, and journalistic. Also it had to be general, that is appropriate for a group of African journalist. In December I worked on the Economist Big Mac index for African countries. 
Wondering around I jumped into kaggle.....

Musing about Economist data, I remembered vaguely something about Olympic medals. For example this one: , about predicting winners based on GDP per capita. Interesting graphic but no data.
Next looking for a database for the Olympics. After playing and tuning Google I came to unknown territory: . This was data from the guardian analyzing summer and winter games and also dat about the countries. Very interesting data. But let's first check the source.

What is ? 
 A Wikipedia article turned up say that it was a Google site, for data scientist and machine learning, exchanging ideas, data, and prediction models ( ). Oh and they share data sets and coding for analyzing these data sets, called kernels. The coding has two flavors: Python or R. So that means you can look for data set and the coding and immediately have the analysis with visualizations. Read more about kaggle:

Browsing a bit deeper in kaggle I run into an Olympic dataset with bio data of the athletes.

Next dive in the kernels
Focusing on R and Olympics. The first one based on 'hottest' or most popular, blows me away completely. A full analysis, with table of content and chapters, with text and visualizations, and push the button “code”and you have an R script ready to run in Rstudio.

The most important part is the competition. Various groups of data sciences are competing with each other to develop a model that produces the reliable predictions. For example this earth quake prediction model: .

Kaggle looks like a data goldmine; it combine interesting data sets with coding, and is useful for creating assignments and just showing how to do an analysis. Python is favored above R. R is more for statisticians, cracking data and Python, Panda's, Jupyter notebook more for developers aiming at machine learning. More about the difference between R and Python:

You want to learn Python or R
Here are some links:

