Why “Python” is the best coding language for data journalism
By Dolly Setton Data Journalists
From the Economist Data Journalism newsletter
Most data journalists would agree that coding has become a core component for good data journalism. But few can agree which programming language is best for it—is it “R” or “Python”? Last week, one of my colleagues explained why they think R is the superior language for data journalism. This week, I will explain why I prefer Python.
Python is, like R, an open-source language so it is free for anyone to use. Unlike R, it is a full-fledged, general-purpose language and an integral part of digital life. It powers popular tools and platforms such as Google Search, Spotify, Instagram and YouTube. Guido Van Rossum, a Dutch developer, released the first version of Python in 1991.
Python is both one of the fastest-growing major programming languages and, according to the Tiobe Index, the most popular, as of October 2021. Python’s popularity owes much to the simplicity and efficiency of its syntax, making it relatively easy to write and read. As with R, many developers make and maintain packages that bundle up code, data and documentation that are useful for data journalism as well as other purposes. The Python Package Index shows that there are a total of 332,600 different pieces of software currently available to download—almost 20 times more than those available for R—which can do everything from searching online news articles to handling big datasets.
Data journalists often run Python in a “Jupyter notebook” (see image below), an open-source web application that creates a document that allows you to import data, run code and create visualisations. It is a handy tool to keep a record of data explorations, create charts, style text and share the results of that work.
For data analysis, the cornerstone package in Python is “Pandas”. It allows you to manipulate data in the same table format as R and makes it easy to tackle missing data, form new columns and much more. Another essential set of tools are “scikit-learn” and “NumPy”, which work like a charm for predictive modelling and machine learning. The “Statsmodels” module focuses on traditional statistical methods. Finally, “Matplotlib” and “seaborn” make it easy to chart the results— and “Plotly” can make them interactive (see image below).
Python is also excellent for anyone who wants to take their data analysis further. Deep-learning research, which can be useful for some predictive modelling work, is made possible in Python through libraries such as “Keras”, “PyTorch”, and “TensorFlow”. And Python is not just good for big numbers. Textual analysis often offers rich potential for stories too, and Python excels in this area thanks to libraries such as “NLTK” and “Gensim”.
And finally, a crucial component of many data-driven stories is scraping. Though we can sometimes download data from trusted sources, scraping them from websites with some nifty code can enable us to get unique stories by creating new datasets, such as for this article about swearing on Mumsnet. Python makes scraping easy through libraries such as “Requests” and “Beautiful Soup”.
Python is so widely used that there is an enormous amount of community support available, at sources such as Stack Overflow, Codecademy, PySlackers, Python Discord and PyLadies. And if you ever want to go beyond the number crunching and decide to develop an app, Python can help with that too. But my favourite argument for choosing Python is that you get to write code in a language named not for a reptile, nor for the letter that comes before “S” in the alphabet, but after the great British comedy troupe Monty Python.