I was looking for an
instructive data set to use at a data journalism training. After
World Bank data focusing on GDP and life expectancy, or sovereign
debt and sovereign ratings from Trading Economics, or financial
inclusion, I needed a fresh start; something new, interesting, and
journalistic. Also it had to be general, that is appropriate for a
group of African journalist. In December I worked on the Economist
Big Mac index for African countries.
Wondering around I jumped into kaggle.....
Musing about
Economist data, I remembered vaguely something about Olympic medals.
For example this one:
https://www.economist.com/graphic-detail/2016/08/23/the-rich-are-different-at-the-olympics
, about predicting winners based on GDP per capita. Interesting
graphic but no data.
Next looking for a
database for the Olympics. After playing and tuning Google I came to
unknown territory:
https://www.kaggle.com/the-guardian/olympic-games#dictionary.csv
. This was data from the guardian analyzing summer and winter games
and also dat about the countries. Very interesting data. But let's
first check the source.
What is kaggle.com ?
A Wikipedia article turned up say that it was a Google site, for
data scientist and machine learning, exchanging ideas, data, and
prediction models (https://en.wikipedia.org/wiki/Kaggle
). Oh and they share data sets and coding for analyzing these
data sets, called kernels. The coding has two flavors: Python or R.
So that means you can look for data set and the coding and immediately
have the analysis with visualizations. Read more about kaggle:
https://medium.com/implodinggradients/should-you-kaggle-5b8dbdef442f
Browsing a bit
deeper in kaggle I run into an Olympic dataset with bio data of the
athletes.
https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
Next dive in the
kernels
Focusing on R and Olympics. The first one based on 'hottest'
or most popular, blows me away completely. A full analysis, with
table of content and chapters, with text and visualizations, and push
the button “code”and you have an R script ready to run in
Rstudio.
Have a look at this
one:
https://www.kaggle.com/heesoo37/olympic-history-data-a-thorough-analysis
The most important
part is the competition. Various groups of data sciences are
competing with each other to develop a model that produces the
reliable predictions. For example this earth quake prediction model:
https://www.kaggle.com/c/LANL-Earthquake-Prediction
.
Kaggle looks like a
data goldmine; it combine interesting data sets with coding, and is
useful for creating assignments and just showing how to do an
analysis. Python is favored above R. R is more for statisticians,
cracking data and Python, Panda's, Jupyter notebook more for
developers aiming at machine learning. More about the difference
between R and Python:
https://medium.com/@data_driven/python-vs-r-for-data-science-and-the-winner-is-3ebb1a968197
You want to learn
Python or R
Here are some links:
For Python:
https://www.kaggle.com/learn/overview
and also:
http://hamelg.blogspot.com/2015/12/python-for-data-analysis-index.html?view=classic