donderdag 28 februari 2019

DATA JOURNALISTS: WHAT DO YOU KNOW ABOUT KAGGLE? (1)


I was looking for an instructive data set to use at a data journalism training. After World Bank data focusing on GDP and life expectancy, or sovereign debt and sovereign ratings from Trading Economics, or financial inclusion, I needed a fresh start; something new, interesting, and journalistic. Also it had to be general, that is appropriate for a group of African journalist. In December I worked on the Economist Big Mac index for African countries. 
Wondering around I jumped into kaggle.....


Musing about Economist data, I remembered vaguely something about Olympic medals. For example this one: https://www.economist.com/graphic-detail/2016/08/23/the-rich-are-different-at-the-olympics , about predicting winners based on GDP per capita. Interesting graphic but no data.
Next looking for a database for the Olympics. After playing and tuning Google I came to unknown territory: https://www.kaggle.com/the-guardian/olympic-games#dictionary.csv . This was data from the guardian analyzing summer and winter games and also dat about the countries. Very interesting data. But let's first check the source.

What is kaggle.com ? 
 A Wikipedia article turned up say that it was a Google site, for data scientist and machine learning, exchanging ideas, data, and prediction models (https://en.wikipedia.org/wiki/Kaggle ). Oh and they share data sets and coding for analyzing these data sets, called kernels. The coding has two flavors: Python or R. So that means you can look for data set and the coding and immediately have the analysis with visualizations. Read more about kaggle: https://medium.com/implodinggradients/should-you-kaggle-5b8dbdef442f

Browsing a bit deeper in kaggle I run into an Olympic dataset with bio data of the athletes. https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

Next dive in the kernels
Focusing on R and Olympics. The first one based on 'hottest' or most popular, blows me away completely. A full analysis, with table of content and chapters, with text and visualizations, and push the button “code”and you have an R script ready to run in Rstudio.

The most important part is the competition. Various groups of data sciences are competing with each other to develop a model that produces the reliable predictions. For example this earth quake prediction model: https://www.kaggle.com/c/LANL-Earthquake-Prediction .

Kaggle looks like a data goldmine; it combine interesting data sets with coding, and is useful for creating assignments and just showing how to do an analysis. Python is favored above R. R is more for statisticians, cracking data and Python, Panda's, Jupyter notebook more for developers aiming at machine learning. More about the difference between R and Python: https://medium.com/@data_driven/python-vs-r-for-data-science-and-the-winner-is-3ebb1a968197

You want to learn Python or R
Here are some links:



Geen opmerkingen:

Een reactie posten

Opmerking: Alleen leden van deze blog kunnen een reactie posten.