d3-media: DYLAN'S DATA with R

The Nobel prize for literature is awarded to Bob Dylan. How to report about that? From a data journalism perspective there are interesting possibilities. NRC Handelsblad published an info graphics. Interesting, but there are other possibilities using R and Tableau. Here are a few examples. If you are interested in the how-to, follow the more tag.

Here is the recipe.

1. Find lyrics of the songs of Dylan. I discovered the following interesting collection. The songs are in pdf; so convert them to a .txt file. Use for example: http://pdftotext.com/nl/ for conversion. Have a look at your text document: lyrics.txt. It is a document of around 270 pages and lyrics up to 2009. So the most important recordings are present.

2. Next we have to clean up the document for text analysis with R. Here is an overview of different steps for text analysis.

We need the following libraries in R-studio:

"tm", "SnowballCC", "RColorBrewer", "ggplot2", "wordcloud", "biclust", "cluster", "igraph", "fpc" "Rcampdf"

We load the lyrics.txt in R as docs and start cleaning up the text to make it ready for analysis:

docs <- Corpus(DirSource(cname))

summary(docs)

Output:

Length Class Modelyrics.txt 2 PlainTextDocument list

Clean up the txt with the following:

docs <- tm_map(docs, removePunctuation)

docs <- tm_map(docs, removeNumbers)

docs <- tm_map(docs, tolower)

docs <- tm_map(docs, removeWords, stopwords("english"))

docs <- tm_map(docs, stemDocument)

docs <- tm_map(docs, stripWhitespace)

docs <- tm_map(docs, PlainTextDocument)

Now we end with a plain text document which can staged for analysis.

We turn the text doc into a document term matrix.

dtm <- DocumentTermMatrix(docs)

and a term document matrix:



tdm <- TermDocumentMatrix(docs)

With the following we create a csv file to use in Excel and Tableau.

First calculate  the word frequencies, then remove some white space and finally save it as .csv

freq <- colSums(as.matrix(dtm))

dtms <- removeSparseTerms(dtm, 0.1)

freq <- colSums(as.matrix(dtms))

wf <- data.frame(word=names(freq), freq=freq)

write.csv(wf, file="lyrics.csv")

The csv in excel shows a frequency table of the words in lyrics, about 7000 words. Making a word

cloud of the first 100 could be done but does not show much. Select in the in Excel imported .

csv “cities” and “blues-words” and make a word cloud from them for example in Tableau. Tableau is

interesting for production because you can save your result as .pdf (and turn in .svg for hardcopy)

and secondly as embedded link for on line.

Link to the Tableau dashboard: keywords songs:

https://public.tableau.com/views/dylan_blues2/Dashboard1?:embed=y&:display_count=yes

Long words get in a cloud more attention, therefore you can easily turn the cloud into a bar graph

in Tableau.

 Can we something about the sentiment in the lyrics?  For a sentiment analysis scrutinizing positive

and negative words we use with the following recipe:

library("syuzhet")

We change the docs (from the start, which is still in the memory) into a  character vector doc and

calculate the sentiment:

docs2<-as.character(docs)

dSentiment <- get_nrc_sentiment(docs2)

dSentiment

  anger anticipation disgust fear joy sadness surprise trust negative positive

  275          246     209    358 225     314      157   326      662      541

Finally lets print this output of sentiments:

sentimentTotals <- data.frame(colSums(dSentiment))

names(sentimentTotals) <- "count"

sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)

rownames(sentimentTotals) <- NULL

ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +

 geom_bar(aes(fill = sentiment), stat = "identity")

theme(legend.position = "none") +

xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for

 Lyrics Bob Dylan")

d3-media

vrijdag 21 oktober 2016

DYLAN'S DATA with R

Geen opmerkingen:

Een reactie posten