The Nobel prize for literature is awarded to Bob Dylan. How to report about that? From a data journalism perspective there are interesting possibilities. NRC Handelsblad published an info graphics. Interesting, but there are other possibilities using R and Tableau. Here are a few examples. If you are interested in the how-to, follow the more tag.
Here is the recipe.
1. Find lyrics of the songs of Dylan. I discovered the following interesting collection. The songs are in pdf; so convert them to a .txt file. Use for example: http://pdftotext.com/nl/ for conversion. Have a look at your text document: lyrics.txt. It is a document of around 270 pages and lyrics up to 2009. So the most important recordings are present.
2. Next we have to clean up the document for text analysis with R. Here is an overview of different steps for text analysis.
We need the following libraries in R-studio:
"tm", "SnowballCC", "RColorBrewer", "ggplot2", "wordcloud", "biclust", "cluster", "igraph", "fpc" "Rcampdf"
We load the lyrics.txt in R as docs and start cleaning up the text to make it ready for analysis:
docs <- Corpus(DirSource(cname))
Length Class Modelyrics.txt 2 PlainTextDocument list
Clean up the txt with the following:
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
Now we end with a plain text document which can staged for analysis.
We turn the text doc into a document term matrix.
dtm <- DocumentTermMatrix(docs)
and a term document matrix:
tdm <- TermDocumentMatrix(docs)
With the following we create a csv file to use in Excel and Tableau.
First calculate the word frequencies, then remove some white space and finally save it as .csv
freq <- colSums(as.matrix(dtm))
dtms <- removeSparseTerms(dtm, 0.1)
freq <- colSums(as.matrix(dtms))
wf <- data.frame(word=names(freq), freq=freq)
The csv in excel shows a frequency table of the words in lyrics, about 7000 words. Making a word
cloud of the first 100 could be done but does not show much. Select in the in Excel imported .
csv “cities” and “blues-words” and make a word cloud from them for example in Tableau. Tableau is
interesting for production because you can save your result as .pdf (and turn in .svg for hardcopy)
and secondly as embedded link for on line.
Link to the Tableau dashboard: keywords songs:
Long words get in a cloud more attention, therefore you can easily turn the cloud into a bar graph
Can we something about the sentiment in the lyrics? For a sentiment analysis scrutinizing positive
and negative words we use with the following recipe:
We change the docs (from the start, which is still in the memory) into a character vector doc and
calculate the sentiment:
dSentiment <- get_nrc_sentiment(docs2)
anger anticipation disgust fear joy sadness surprise trust negative positive
275 246 209 358 225 314 157 326 662 541
Finally lets print this output of sentiments:
sentimentTotals <- data.frame(colSums(dSentiment))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
geom_bar(aes(fill = sentiment), stat = "identity")
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for
Lyrics Bob Dylan")