dinsdag 24 juli 2018

Text Mining Made Easy

When I am doing a text analysis I generally use R. Are has various libraries for text analysis and there are also howto's. Here is one for basic text mining in R by Philip Murphy: https://rstudio-pubs-static.s3.amazonaws.com/265713_cbef910aee7642dc8b62996e38d2825d.html. When you the basic or R and R studio, this works like a cooking recipe. However for a training for data journalists this a bit over the top. Because first you to talk some theory about text mining, next introduce R and R studio, and then take them step by step through an example. This learning curve is a bit steep.
Found some light at the end of the Google tunnel: voyant tools. https://voyant-tools.org/ . Voyant tools is easy and simple to handle, it is web based, it is free, and has lots of possibilities for analysis. Ranging from simple word frequencies and word clouds, but also correlation between words, links between words in a network. On top pf this all your visualizations like a word cloud can separately  be download as .png or .svg. Or data like word frequencies can be download as .csv. And finally there is a link to the page with whole analysis. Upload your data and start mining. 

zondag 8 juli 2018


Data journalism is already more than fifty years old. It started in the sixties as precision journalism with Phil Meyer, then CARR computer assisted research and reporting and now data journalism. The shortest definition of data journalism is 'social science done on deadline' (Steve Dough). We incorporate the tools of the social sciences to analyze data and include them in our storytelling.
In the beginning, some 10-15 years ago, practicing data journalism needed extra skills and training. Scraping data, cleaning up and analyzing in Excel, making graphs in maps, getting data into the story, this all needed some extra journalism training. Therefore data journalism became a specialization of journalism.

The field is changing fast, and data journalism becomes a do-it-your-self toolkit that everybody can use with a minimum number of skills and understanding. Take a tool like Flourish https://app.flourish.studio/ for example: put the data in and push a button a get the graph of a map. Or the latest: workbench. Clean, scrape, analyze and visualize data without coding. A project from Columbia J-school at New York. Sign-up and get started:http://workbenchdata.com/. All the data journalism tools integrated in one package.

Reflecting on data journalism on his onlinejournalism blog, Paul Bradshaw creates two categories of data journalism training: teaching slow or fast. Teaching data journalism fast works as follows: “For many years I began my introductory data journalism classes with basic spreadsheet techniques, followed by visualization sessions to show them how to bring some of the results to life. In 2016, however, I decided to try something different: what if, instead of taking students through the process chronologically, we started at the end — and worked backwards from there? The class worked like this: students were given a spreadsheet of several tables already ready to be turned into a chart”. The new tools just mentioned not only make data journalism easy, but also clears the way for thinking about the story to be produced, and not too much about the technology and number crunching behind it.


When I switched on the Internet at the School of Journalism at the end of the eighties of the past century. I was impressed by the idea of electronic communication: ranging from e-mail to IRC chat.
This would enhance communication and understanding, and contribute to democracy. Now the opposite is the case. At the heart of their disenchantment, is that the internet has become much more “centralised” (in the tech crowd’s terminology) than it was even ten years ago”….”the system was “biased in favour of decentralisation of power and freedom to act”, writes the Economist .

From de-centralized to centralized
Instead of have direct one-on-one communication, decentralized and uncontrolled, we are working on controlled centralized systems. “These days the main way of getting online is via smartphones and tablets that confine users to carefully circumscribed spaces, or “walled gardens”, which are hardly more exciting than television channels “. It almost looks like that the times before the Internet have returned. Is Facebook so different from what was once Compuserve?

The decentralized infrastructure of the Internet is still there. On the basic level the net still runs on TCP/IP . “The connections to transfer information still exist, as do the protocols, but the extensions the internet has spawned now greatly outweigh the original network”. Not the basic level but the levels higher up are centralized and controlled. Consumer websites and all these apps. Take the social networks for example, we work on the machines of Facebook (comparable with Compuserve mainframe). “The best way to picture all this is as a vast collection of data silos with big pipes between them, connected to all kinds of devices which both deliver services and collect more data”.

Data business
How could that happen? Answer: data! “The Google search engine attracts users, which attracts suppliers of content (in Google’s case, websites that want to be listed in its index), which in turn improves the user experience, and so on. Similarly, the more people use Google’s search service, the more data it will collect, which helps to make the results more relevant. “ And the same counts for Facebook or Instagram. Data and targeted advertising are the basis of the business model which turned the Internet in a totally different beast. “Having tried to sell its technology to companies, it went for advertising, later followed by Facebook and other big internet firms. That choice meant they had to collect ever more data about their users. The more information they have, the better they can target their ads and the more they can charge for them.”

Take back control
What can we do to take back our original control over our communication on the internet? Below give a summary of 4 possible solutions based on the literature referred in the links.