The Nobel prize for literature is awarded to Bob Dylan. How to report about that? From a data journalism perspective there are interesting possibilities. NRC Handelsblad published an info graphics. Interesting, but there are other possibilities using R and Tableau. Here are a few examples. If you are interested in the how-to, follow the more tag.
Here is the recipe.
1. Find
lyrics of the songs of Dylan. I discovered the following interesting collection.
The songs are in pdf; so convert them to a .txt file. Use for example: http://pdftotext.com/nl/
for conversion. Have a look at your text document: lyrics.txt. It is a document
of around 270 pages and lyrics up to 2009. So the most important recordings are
present.
2. Next
we have to clean up the document for text analysis with R. Here
is an overview of different steps for text analysis.
We need the following
libraries in R-studio:
"tm",
"SnowballCC", "RColorBrewer", "ggplot2",
"wordcloud", "biclust", "cluster",
"igraph", "fpc" "Rcampdf"
We load the lyrics.txt in R as docs
and start cleaning up the text to make it ready for analysis:
docs <-
Corpus(DirSource(cname))
summary(docs)
Output:
Length Class Modelyrics.txt 2 PlainTextDocument list
Clean up the txt with the following:
docs <-
tm_map(docs, removePunctuation)
docs <-
tm_map(docs, removeNumbers)
docs <-
tm_map(docs, tolower)
docs <-
tm_map(docs, removeWords, stopwords("english"))
docs <-
tm_map(docs, stemDocument)
docs <-
tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
Now we end with a plain text document
which can staged for analysis.
We turn the text doc into a document term
matrix.
dtm <- DocumentTermMatrix(docs)
and a term document matrix:
tdm <- TermDocumentMatrix(docs)
With the following we create a csv file to use in Excel and Tableau.
First calculate the word frequencies, then remove some white space and finally save it as .csv
freq <-
colSums(as.matrix(dtm))
dtms <-
removeSparseTerms(dtm, 0.1)
freq <-
colSums(as.matrix(dtms))
wf <-
data.frame(word=names(freq), freq=freq)
write.csv(wf,
file="lyrics.csv")
The csv in
excel shows a frequency table of the words in lyrics, about 7000 words.
Making a word
cloud of the first 100 could be done but does not show much.
Select in the in Excel imported .
csv “cities” and “blues-words” and make a
word cloud from them for example in Tableau. Tableau is
interesting for
production because you can save your result as .pdf (and turn in .svg for
hardcopy)
and secondly as embedded link for on line.
Link to
the Tableau dashboard: keywords songs:
Long words
get in a cloud more attention, therefore you can easily turn the cloud into a
bar graph
in Tableau.
Can we something about the sentiment in the lyrics? For a sentiment analysis scrutinizing positive
and negative words we use with the following recipe:
library("syuzhet")
We change the docs (from the start, which is still in the memory) into a character vector doc and
calculate the sentiment:
docs2<-as.character(docs)
dSentiment <- get_nrc_sentiment(docs2)
dSentiment
anger anticipation disgust fear joy sadness surprise trust negative positive
275 246 209 358 225 314 157 326 662 541
Finally lets print this output of sentiments:
sentimentTotals <- data.frame(colSums(dSentiment))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
geom_bar(aes(fill = sentiment), stat = "identity") theme(legend.position = "none") + xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for Lyrics Bob Dylan")
|
Geen opmerkingen:
Een reactie posten
Opmerking: Alleen leden van deze blog kunnen een reactie posten.