vrijdag 14 juni 2019

Habermas 90!

Die Zeit is a great newspaper, still a broad sheet, wide pages for more than 200 words and it smells like a newspaper. From the all the  sections one is very interesting: FEUILLETON, dedicated to 90th birthday of Habermas. A crowed of professors are given their opinion on the greastest living philosopher. A must read!

dinsdag 11 juni 2019

WORDT DE DATAJOURNALIST OOK DATASCIENTIST?


Ga ik nu echt met pensioen? Het merendeel van de journalisten beheerst tegenwoordig het bekende Excel spreadsheet en het maken van een grafiek of een kaartje met bijvoorbeeld Tableau of Plotly. In Tanzania by Mwananchi (Citizen) of Habari Leo (Daily News), werken journalisten met deze standaardtools. Het probleem is alleen dat de overheid de krant een verschijningsverbod oplegt als de cijfers niet passen in het overheidsbeleid. En als je niet op past beland je achter de tralies.

R in de mode
Trainingen voor financieel-economisch journalisten, tijdens Highway Africa Barclays/ABSA training, leerde mij dat we verder moeten gaan dan dit standaardgereedschap. Bij deze meer gespecialiseerde journalisten is belangstelling voor meer statistisch achtergrond bij de data. R is daarbij een goede start: https://www.denieuwereporter.nl/2014/04/vijf-redenen-om-r-te-gebruiken-in-datajournalistiek/
Binnen de datajournalistiek is R 'hot'. In trainingen van the IRE zit standaard een module over werken met R.
The Economist besloot onlangs analyse van data en visualisaties , gemaakt in R, the publiceren op Github. Je kunt ze gemakkelijk downloaden en zo zelf de analyse opniew uitvoeren. De Big Mac Index is een aardig voorbeeld( https://github.com/TheEconomist/big-mac-data ) Economische data voor Sub Sahara Africa ontbreken bij de Economist. Een aardige trainingsopdracht was die data te vinden en vervolgens een Big Mac Index te berekenen voor Sub Sahara Africa.
Ook de BBC besloot meer aandacht aan R te schenken; in het bijzonder visualisering met R(https://medium.com/bbc-visual-and-data-journalism/how-the-bbc-visual-and-data-journalism-team-works-with-graphics-in-r-ed0b35693535 )

maandag 10 juni 2019

NEDERLAND 5% KANS OP WINNEN VAN DE FIFA WOMEN' S WORLD CUP

AI -a random forest ensemble learner- voorspelt:
NEDERLAND 5% KANS OP WINNEN VAN DE FIFA WOMEN' S WORLD CUP

AI is ook doorgebroken in het voorspellen van sportwedstrijden. R-Bloggers - https://www.r-bloggers.com/hybrid-machine-learning-forecasts-for-the-2019-fifa-womens-world-cup/ - publiceerde onlangs de voorspellingen van een onderzoek dat gebruikt maakt van random forest, een programma dat gebruik maakt van artificiele intelligentie software.

De voorspellingen zijn gebaseerd op de volgende data:
- een schatting van de sterkte gebaseerd op een reeks historische matches;
- een schatting gebaseerd op de voorspellingen van 18 bookmakers;
- een aantal teamvariabelen en een aantal variabelen die specifiek zijn voor elk land.


Met deze data wordt het random forest algoritme getraind en een model geconstrueerd.
Met dit model:
- wordt de kans op 'winnen, verliezen, of een gelijk spel' berekend;
- en tenslotte de kans op winnen van de FIFA Women' s World Cup  2019.

Dit is een interessante ontwikkeling, die ook nieuwe mogelijkheden opent voor de sportjournalistiek in het algemeen en de datajournalistiek in het bijzonder.





 



dinsdag 4 juni 2019

Airdroid connect your phone to pc

I had a backup of all my apps. Huray! My phone crashed again.  It is an old Samsung S5, working fine but could not root it using the standard method flashing an root image using Odin/Jodin(for Linux). Let's try some new Magisk  seems an easy to use tool. No way...I had to flash a stock Rom for the S5.

The  update of the apps was made with Airdoid, and in no time I had restored my apps.
Airdroid is an app that must be installed on your phone. After installation create an account. And then you can login to your phon using a local address in your browser.
First you have access to all the files and apps of your phone and second back up apps or install new one. The most important option are in security and remote features : finding your phone, when lost or stolen, remote control or using your phone as a webcam.


donderdag 9 mei 2019

Ad blocking over VPN

Have a Raspberry Pi in one of my drawers. Found a new job: blocking web advertisement over a VPN. Just install Pi-Hole and PIVPN (openvpn) on the Raspberry. Now I run for free a VPN that blocks all these annoying ads and trackers. There are lots of howto's available for installing. Here is an example.
On the pic below you see on the right mobile phone connection to the VPN,  the log of the VPN server and left Pi Hole data.



woensdag 17 april 2019

Google colaboratry: datajournalists' little helper

Jupyter notebooks are very helpful to share and publish your R code. See more: https://d3-media.blogspot.com/2017/09/jupyter-notebook.html .  However one has to install the software. Now Google has an interesting solution: jupyter notebooks online. To access the notebook go to Google drive and look for colab or colaboratry. If not present install the app: drive, new, more , connect to apps. There is one limitation, that is you can only use python code. Pity because I like R better.

Start to enter the code or add comment  with #. Run the code immediately to see the result. The amazing thing is that you have access to all kind of machine learning software like 'random forest' or 'tensor-flow'. And you run your code free on a GPU. Try to build a model for your prediction.
Here is an example to build a classification.


Of course Google offers help to get started: https://colab.research.google.com/notebooks/welcome.ipynb or try this intro with lot's of extra info: https://towardsdatascience.com/getting-started-with-google-colab-f2fff97f594c .
There is also a free book online to introduce python and the basics of data science: https://github.com/jakevdp/PythonDataScienceHandbook 

donderdag 14 maart 2019

KAGGLE: IS THERE DATA JOURNALISM IN MACHINE LEARNING(4)

The answer is YES, there is data journalism in machine learning. For example there are sessions at IRE 2019 training for machine learning. Here are a few other links relating data journalism and machine learning:

Is there any use of the learning in the newsroom?


What could be the output for data journalist working with machine learning?
I have done some experiments in R at Kaggle using a data set about Dutch municipalities.

1. Clustering using kmeans: makes it possible to generate meaningful clusters of municipalities based on income, population, house value, unemployment etc. Although not very impressive, it is possible to create various centers or clusters  in the data: large municipalities; high income municipalities, and high unemployment. Here is the kernel with the coding and the results: https://www.kaggle.com/peterverweij/clustering-gemeentedata-using-kmeans 


2. Predicting using a simple Linear Model: using two variables in the data set: income and house value. A plot of the data show that there is a strong relationship, shown also by the regression line. Creating a linear model for these variables makes it possible to predict for example income from the value of the house
3. Predicting using randomForrest: The linear model is simple and works for interval variables. But predicting nominal values, for example gender or political party of the mayor, requires a different approach. RandomForrest provided interesting outcomes.
Here is the link to both predicting models: https://www.kaggle.com/peterverweij/prediction-simple-machine-learning
Dat journalists have to dig deep into statistics, but these example show that there is added value for reporting. These example are of course limited; there is a whole set of different machine learning algorithms in R available; i have only tried two. Here is the list:
https://www.kdnuggets.com/2015/06/top-20-r-machine-learning-packages.html

https://www.r-bloggers.com/what-are-the-best-machine-learning-packages-in-r/ 

zaterdag 2 maart 2019

LESSONS FROM KAGGLE(3)

The good thing of Kaggle from the perspective of data journalism is that Kaggle holds an interesting collection of datasets, ranging from the Olympics to Economic Freedom and Sovereign debt.
The formats of the datasets vary from csv to sdl and json.The datasets can be downloaded used for private analysis. However once you selected a database you can search for the kernel; that is the analysis of the data.

Let's take an example: search for data using the tag 'economics'; select the 'Economic Freedom Report'(2018). Scroll down and get an idea of the data. Next top left and select 'open in': either in google sheets or google data studio; use copy API command to download to your own machine.

But before you start working, check what others have been doing with the data. Click on kernels. There are 9 kernels available; in R and in Python. Opening the first one, focussing on 'IS AN ECONOMICALLY FREE COUNTRY A BETTER PLACE TO LIVE?' Perhaps it is useful for your writing, perhaps not and you start analyzing yourself; however looking at these example gives you a nice intro in how to do data analysis.

When you have chosen a dataset you can start a new kernel, and your own analysis. This is interesting because all the software you are using is in the cloud; you are working on a virtual machine in the cloud with R and Python installed, and it runs the code you are entering....all for free. You can save your work; make it public, save with others and ask for comment. I think this is a very strong point of Kaggle.

Here is my example. I uploaded some old data about the Dutch municipalities and used R for some analysis. Have a look at the results:
Dataset in csv:https://www.kaggle.com/peterverweij/data-about-dutch-municipalitiestest
Kernel in R: https://www.kaggle.com/peterverweij/gemeente-test

vrijdag 1 maart 2019

LEARNING R FOR DATA JOURNALISTS

In the seventies of the past century I was trained in using SPSS (on a main frame) for analyzing social data. But when I lost the free licence to use SPSS, I had to find an alternative. Of course spreadsheets were available but that it is not real statistics. Then I found out that R could load a GUI for SPSS. Interesting, but I learned though that R had more possibilities: packages for all kind of analysis. Second I found RStudio, a perfect GUI for the R.

But how do you get started with R, still remembering that data journalism is the goal.


Why?
5 reason for using R:
http://memeburn.com/2014/05/5-compelling-arguments-for-using-r-in-data-journalism/ 


Here is a start to learn the basics

Computerworld published a The Beginner’s Guide to R. This 30-page guide will show you how to install R, load data, run analyses, make graphs, and more.

Complete guide as PDF download: https://www.computerworld.com/article/2884322/learn-r-programming-basics-with-our-pdf.html#tk.ctw-eos

Other Resources published by Computerworld:
60+ R resources to improve your data skills: https://www.computerworld.com/article/2497464/top-r-language-resources-to-improve-your-data-skills.html

Overview: Quick R: http://www.statmethods.net/ 

And finally an intro to R coming from Kaggle: https://www.kaggle.com/hamelg/intro-to-r-index  or  http://hamelg.blogspot.com/2015/10/introduction-to-r-index.html?view=classic

Teach yourself basic R, interactive in R studio using swirl pacakge.
Swirl, an R package designed to teach you R straight from the command line. Swirl provides exercises and feedback from within your R session to help you learn in a structured, interactive way.
Swirl, more https://www.rstudio.com/online-learning/

Here is a free book covering all the basics:
Import the pdf and read this intro to R on Kindle:
https://www.r-bloggers.com/yarrr-the-pirates-guide-to-r-2/ 

Keep informed about the latest developments and follow: https://www.r-bloggers.com/ 

WHAT KAGGLE CAN TEACH DATA JOURNALISTS (2)

Although Kaggle focuses on data science and machine learning there is a lot of interesting knowledge and ideas data journalists can get out of this platform.

Data
The easiest way to get data nuggets out this goldmine is by using Google Data Studio( see also: https://d3-media.blogspot.com/2019/02/and-even-more-viz-tools.html ) and or Google Sheets (Google Drive). Login to Kaggle and find a data set by keyword or tag. Click on three dots in the right top corner and choose one of the two option: sheets or data studio.

The second option for getting data is using the API( see: https://www.kaggle.com/docs/api ), choose in the menu: copy API command and use that command in the terminal.  A zip file, with data in .csv format  will be downloaded. From here you can analyze and visualize. For example by importing the data set in R studio.  But a lot of work has already been done.

Kernel
Let's look for a kernel. Again search for olympics and set language to R. click on the kernel you like, and explore the coding and vizs. In the end you can  choose for downloading the code into a jupyter notebook, or use the API. (change the file type from .irnb to .ipynb) if you have the data ready you can run the code step by step.

Advanced
Now let's go for the advanced stuff and dig into datascience. Could I run a kernel on the Kaggle platform? We leave data journalism and are fully using the possibilities of Kaggle.
Watch the following short videos: Getting Started on Kaggle: Python coding in Kernels   and
How to Make a Data Science Project with Kaggle 



donderdag 28 februari 2019

DATA JOURNALISTS: WHAT DO YOU KNOW ABOUT KAGGLE? (1)


I was looking for an instructive data set to use at a data journalism training. After World Bank data focusing on GDP and life expectancy, or sovereign debt and sovereign ratings from Trading Economics, or financial inclusion, I needed a fresh start; something new, interesting, and journalistic. Also it had to be general, that is appropriate for a group of African journalist. In December I worked on the Economist Big Mac index for African countries. 
Wondering around I jumped into kaggle.....


Musing about Economist data, I remembered vaguely something about Olympic medals. For example this one: https://www.economist.com/graphic-detail/2016/08/23/the-rich-are-different-at-the-olympics , about predicting winners based on GDP per capita. Interesting graphic but no data.
Next looking for a database for the Olympics. After playing and tuning Google I came to unknown territory: https://www.kaggle.com/the-guardian/olympic-games#dictionary.csv . This was data from the guardian analyzing summer and winter games and also dat about the countries. Very interesting data. But let's first check the source.

What is kaggle.com ? 
 A Wikipedia article turned up say that it was a Google site, for data scientist and machine learning, exchanging ideas, data, and prediction models (https://en.wikipedia.org/wiki/Kaggle ). Oh and they share data sets and coding for analyzing these data sets, called kernels. The coding has two flavors: Python or R. So that means you can look for data set and the coding and immediately have the analysis with visualizations. Read more about kaggle: https://medium.com/implodinggradients/should-you-kaggle-5b8dbdef442f

Browsing a bit deeper in kaggle I run into an Olympic dataset with bio data of the athletes. https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

Next dive in the kernels
Focusing on R and Olympics. The first one based on 'hottest' or most popular, blows me away completely. A full analysis, with table of content and chapters, with text and visualizations, and push the button “code”and you have an R script ready to run in Rstudio.

The most important part is the competition. Various groups of data sciences are competing with each other to develop a model that produces the reliable predictions. For example this earth quake prediction model: https://www.kaggle.com/c/LANL-Earthquake-Prediction .

Kaggle looks like a data goldmine; it combine interesting data sets with coding, and is useful for creating assignments and just showing how to do an analysis. Python is favored above R. R is more for statisticians, cracking data and Python, Panda's, Jupyter notebook more for developers aiming at machine learning. More about the difference between R and Python: https://medium.com/@data_driven/python-vs-r-for-data-science-and-the-winner-is-3ebb1a968197

You want to learn Python or R
Here are some links:



donderdag 21 februari 2019

And even more viz tools

In December at Highway Africa conference I did a workshop about 'data journalism made easy': from Flourish to Workbenchdata. More and more tools for analysis and visualization are available for data diggers.
Oh serendipity, when playing with Google analytics....I found a new Google tool: Data Studio.   This is welcome page that provides some background about this viz tool. Click on the right top corner link or go to datastudio.google.com; login with your google account. Create a new report and select the data source. I simple choose for uploading some .csv data. Then insert the type of graphic you want. All the standard options from bars to scatter plots are available. Even maps work fine when you select a continent or a part of that continent.
BUT: you export your graphs as a report, that is pdf!; haven't seen an embed option.



zondag 17 februari 2019

GUI for R ggplot

In data journalism training I often use Tableau to make visualizations. Fast and easy. However sometimes we do analysis of data not in Excel but using Rstats. Plotting data with ggplot is not so easy, because of the variables you have to define.

Now there is a new package 'esquisse' (sketch) , based on Shiny, that works as a GUI for ggplot2. Install the package and start the GUI for ggplot from the addin menu or from the console: esquisse::esquisser  Don't tag the package. Select a data file. In the example below I use GDP and life expectancy for Sub Sahara African countries. At the bottom line you can do fine tuning for labels, colors and theme, select data and export the plot.
See for example my static blog using R blogdown:
https://d3media.netlify.com/post/export-code-from-esquisse-for-ggplot/













maandag 11 februari 2019

Write your blogs in R STUDIO with Markdown

You are working in R and want to show or export your code? Simple we use R markdown. Create a markdown - Rmd - doc with your code and comments. Then export the Rmd with Knit to Word, Pdf or html.
Publishing your R Code directly is not complicated using the new package blogdown. Blogdown is an effort to integrate R markdown with open source static web generators like Hugo. To deploy your blog  with the Rmarkdown use Netlify, a platform for managing and deploying modern web projects.
Below is the workflow.


maandag 4 februari 2019

De de-kolonisatie van het internet

Habermas zou misschien glimlachend hebben meegeluisterd.  Vorige maand organiseerde Meetup Arhem in Villa Sonsbeek een discussieavond onder de titel”Het internet is stuk”.   Geert-Jan Bogaerts, hoofd digitaal van de VPRO, presenteerde hier het plan PublicSpacesPublicSpaces verzet tegen de kolonisering van het internet door de digitale grootmachten: Facebook, Google etc. Een soort “Strukturwandel der Öffentlichkeit”, maar dan digitaal. Immers in het begin was het internet een open en decentrale ruimte die vergezichten opriep van vrijheid en discussie;  een “electronic town hall” vergelijkbaar met de Atheense democratie schreef Al Gore.  Nu is het internet stuk.

zondag 3 februari 2019

TWO TRENDS IN DATA JOURNALISM

In this blog post I am trying to synthesize my thinking about tools and developments in data journalism.

Data journalism is getting easier. Writing difficult formulas in Excel or doing  a database join in order to make a map in Google Fusion Tables is over.  The last years new software is developed and available to do data journalism. Every reporter and editor should be able to work with these new web based services for reporting data stories. Data wrapper is already a classic, but Workbenchdata and Flourish are new easy use tools.

R Project
The digging deeper in data and visualizing complicated relationships need another approach. R project is an interesting candidate, and there are good reasons to use R in journalism. Two British media, the Economist and the BBC have published data journalism stories based on R. At Birmingham City University is Paul Bradshaw training journalism students in an MA for these job in R and coding.
So there is an movement into the opposite direction. Data journalism is getting more complicated by  integrating skills and tools of computer science and statistical analysis.

Economist
The Big Mac index by the Economist is a nice example. This index tries to establish whether  currencies are over or undervalued. The data and the calculations in R for the Big Mac index are published  at Github  as a Jupyter Notebook .

BBC
The BBC got also into R especially for making graphs with ggplot and the need for standardization of that production. Datajournalists at the BBC developed their own package in R to do the work called bbplot. Every R user can install this package and start producing graphs the BBC way. Working with R is not self evident and beginners need some help. That is the cookbook of the BBC , with recipes for various graphs from line to bar or scatter diagrams.
It is remarkable when the BBC writes: “We don’t use it for interactive graphics, for which the Javascript library D3 is better suited….or static charts we’ve found R and ggplot2 to be very useful”. D3 is a good choose, but you need java script and the D3 library to do that work which something completely different. I think it would be easier to use plotly for exporting ggplots, or use R's Shiny server for interactive graphs.

I am quite enthusiastic about this development in data journalism. In times of diseases like fake news and Facebook manipulation is fact-based reporting the only medicine.

woensdag 16 januari 2019

Mirror Android Phone on Linux

Getting you Android phone connected to a Linux box is not so difficult; install and run adb (android developers bridge) to swap files. Mirroring and controlling the device is a bit more difficult. Using scrcpy makes it possible.
How does that work? scrcpy installs a server on your phone and communicates with the Linux box over adb. This server sends a video stream to the  Linux box; all input (mouse and keyboard) events are captured, which makes interaction with the phone easy.
For installing scrcpy there are to possibilities:
- build the app manually;
- run a docker image

donderdag 10 januari 2019

THE INTERNET IS BROKEN AND FACEBOOK HOLDS THE SMOKING GUN


The central conclusion from the documentary the Facebook Dilemma, that was recently broad-casted on TV, is that Facebook was too late and too slow in handling fake news, hate speech and fake accounts, manipulation and social conflict. It is interesting to see how Mark Zuckerburg defends his company by saying that Facebook is a tech company aiming at bringing people together. As a tech company Facebook/Zuckerburg pretends that tech tools will deliver the solutions for all, social and political, problems created by Facebook. The parallel with the movie about the creation of Facebook-The Social Network- is obvious. In that movie Zuckerberg, then a clumsy geek at Havard, created a computer program  for rating female students, hacking the database of the university. Perhaps compensating his own lack of social skills to get in touch with female students.

Media Company
The problem is that Facebook is not a tech company but a media company. Controlling the flow of news and information on Facebook is only possible by understanding the role of media in a democracy. At traditional media this role is in the hands of editors who check and filter the news before it is published. Facebook still not admitting they are media, hired thousands of censors checking the flow of information based on key words. Probably in the end aiming at complete automation by algorithm. I don't believe this will work for more than 2,5 billion users on world scale. Is too large and the social and political problems in all corners of the world are too complicated to handle by keyword searching censors. The best solution would be to break up Facebook in different sub-media for example by regions or by topic and in these sub-media the editors play a central role.
Secondly if Facebook acknowledges that they are a media company then they would be controlled by media law and can be hold responsible for publication. Now they escape this type responsibility and control by rules of freedom of speech.
The problems related to the content of Facebook, originate in the business model of Facebook and the design of internet. 

dinsdag 8 januari 2019

RETRO-PLAYTIME 2

it is still holiday, so I play around a bit.... found some retro games from the nineties deep in desk drawer: Day of the Tentacle and Myst; these were the first interactive video games published in the nineties. My children loved playing with these games. But can you revive them?  Installing Win95 in a virtual box is a bot complicated. I discovered scummvm . Install it on a Linux box or a windows machine, and just play:

Myst caused some trouble because copying the data failed, probably some protection.....
Oh, even more convenient, play on a phone or tablet and get scummvm from the google app store.