Workout Wednesday: Feeling Sentimental with Sentiment

I was clearing out old portable hard drives when I came across a book from some of my undergraduate work with Prof. Matthew Jockers. The work we did most dealt with sentiment in novels, or authorship attribution, which ended up as a paper at some point. This brought great memories flooding back. I visited Dr. Jockers website to see if there was something there I could use for today’s Workout Wednesday, and stumbled upon his article, That Sentimental Feeling.

In this article Jockers goes into detail about his r-package, syuzhet and some striking comparisons of sentiment analysis methods including his own. To demonstrate the comparisons he uses six different graphs depicting the sentiment arc across narrative time. Just eye-balling the graphs and they do seem to mirror each other. My recent bouts with sentiment analysis in famous speeches made this a perfect candidate for today’s workout. I was excited to mess with syuzhet as well as I had heard about it as an undergrad during development.

Let’s start with some text



library(rvest)
library(dplyr)
library(ggplot2)
library(syuzhet)
library(tidyr)
#data
links<-data.frame(link=c('https://www.gutenberg.org/files/1661/1661-h/1661-h.htm',
                     'https://www.gutenberg.org/files/1342/1342-h/1342-h.htm',
                     'https://www.gutenberg.org/files/98/98-h/98-h.htm',
                     'https://www.gutenberg.org/files/76/76-h/76-h.htm',
                     'https://www.gutenberg.org/files/4300/4300-h/4300-h.htm',
                     'https://www.gutenberg.org/files/345/345-h/345-h.htm',
                     'https://www.gutenberg.org/files/74/74-h/74-h.htm',
                     'https://www.gutenberg.org/files/174/174-h/174-h.htm'),
                  name=c('The Adventures of Sherlock Holmes',
                      'Pride and Prejudice',
                      'A Tale of Two Cities',
                      'Adventures of Huckleberry Finn',
                      'Ulysses',
                      'Dracula',
                      'The Adventures of Tom Sawyer',
                      'The Picture of Dorian Gray'
                     ),stringsAsFactors = F)

I am absolutely sure there is a more elegant way of doing this, but I only had an hour or so. So please forgive me.

Now that we have the links in a data frame I wrote this loop to go through the dataframe, read the links I provided, and using the functions from the package grab the sentiment and normalize them. This is important because if you just graphed the raw sentiment data you would get something that looks a little like this.



speeches<-NULL
final<-NULL
for(i in 1:nrow(links)){
  speech <- links$link[i] %>% 
    read_html() %>% 
    html_nodes('body p')%>%
    html_text()
  name<-links$name[i]
  line_number<-1:length(speech)
  indiv.speeches <-data.frame(name,speech,line_number,stringsAsFactors = FALSE)
  indiv.speeches$syuzhet<-get_sentiment(indiv.speeches$speech,method='syuzhet')
  norm.data<-data.frame(syuzhet=rescale(get_percentage_values(indiv.speeches$syuzhet,bins=100)))
  indiv.speeches$nrc<-get_sentiment(indiv.speeches$speech,method='nrc')
  norm.data$nrc<-rescale(get_percentage_values(indiv.speeches$nrc,bins=100))
  indiv.speeches$bing<-get_sentiment(indiv.speeches$speech,method='bing')
  norm.data$bing<-rescale(get_percentage_values(indiv.speeches$bing,bins=100))
  norm.data$book<-name
  norm.data$index<-1:nrow(norm.data)
  speeches<-rbind(speeches,indiv.speeches)
  final<-rbind(norm.data,final)
}

scaled.data<-final%>%gather(method,sent.scaled,1:3)

Now we finally have some data we can work with. The handy functions in the loop above made it so for each book we used three different methods of sentiment analysis to garner three different values. Then we grabbed chunked (still using this terminology since Dr. J's class in undergrad) aggregate data for each book so the data was easier to work with. Which means after the gather we can plot!



#plot
ggplot(scaled.data,aes(index,sent.scaled, color = method,group=method)) +
  geom_line(stat = "identity", show.legend = FALSE) +
  facet_wrap(~book, ncol = 1, scales = "free_x") +
  labs(title = "Sentiment Arc in Novels",
       subtitle='Scaled comparison of sentiment analysis methods for various novels.',
       caption='Matthew Jockers, That Sentimental Feeling, \nProject Gutenberg ', 
       y = "Sentiment",x='') +
  theme_minimal()

The result is a graph similar to the one you see above. What is interesting is that the correlation between most is fairly high. But for some reason in Ulysses the bing method registers a general positive sentiment whereas the other two are more negative. This calls for further inspection, but maybe at a later date.

P.S. your graph may not look exactly like mine because I used a personal ggplot2 theme. Also if you are a student of the humanities looking for a great resource to build your skills in R with specific exercises tailored to coursework? Check out the book by Prof. Jockers, Text Analysis with R for Student's of Literature