Comparing Twitter Sentiment Analysis Scores

In my time doing research in Digital Humanities in school I came across a few different modes and methods for linguistic study, but one that kept popping up was sentiment analysis. Every field and subject has its “hot” topic and I believe sentiment analysis is it in Digital Humanities. Really exciting work has come out of individuals like Matthew Jockers (whose syuzhet package we will be using) comparing human sentiment coding to the scores from the NRC, and other types.

This is a “hot” topic for many reasons. It is one of the new and exciting areas coming to text mining, yet it is also one of the areas of most contention. This is supposed to be a little background and I would like to get right into the R walk-through so if you would like to learn more about digital humanities visit Jocker’s page or see the CDRH (Center for Digital Research in the Humanities at UNL) website.

To cut down on lines of code on this page, and since I have gone over it quite a few times, you should visit Julianhi’s twitter connection tutorial before we get started.

Here are the packages we will need

library(twitteR) #you will need this for the twitteR connection
library(ggplot2)
library(syuzhet)
library(dplyr)

Then we will search the the Twitter-verse for the tweets we want

#Search
tweets<-searchTwitter('I feel',n=200)
ai<-twListToDF(tweets)
#quick clean
sentence<-iconv(enc2utf8(ai$text), sub = "byte")
sentence<-gsub('[[:punct:]]','',sentence)
sentence<-tolower(sentence)
#add back to the df
ai$text<-sentence

Finally we are at the point where we can start adding the sentiment analysis scores. The syuzhet package includes a few different sentiment scoring methods, so let’s make a plot that shows them. One will notice that each one may look a little different. The NRC sentiment score needs to be calculated by making the negative score column negative and then adding it to the positive score column. The last line of each code block assigns a category of positive or negative to the score so we can plot and color the charts.

Each sentiment analysis score is grouped by a #title

#NRC Sentiment
ai$text<-as.character(ai$text)
hms<-get_nrc_sentiment(ai$text)
hms$negative<-hms$negative*-1
hms$score<-hms$negative+hms$positive
ai$score<-(cbind(hms$score))
ai$tweet<-ifelse(ai$score > 0, 'positive', ifelse(ai$score < 0, 'negative', 'neutral'))
#Afinn Sentiment
ai$text<-as.character(ai$text)
hms<-get_sentiment(ai$text,method=c('afinn') )
hms<-t(hms)
hms<-t(hms)
ai$afinn<-hms
ai$tweeta<-ifelse(ai$afinn > 0, 'positive', ifelse(ai$afinn < 0, 'negative', 'neutral'))
#Bing Sentiment
ai$text<-as.character(ai$text)
hms<-get_sentiment(ai$text,method=c('bing') )
hms<-t(hms)
hms<-t(hms)#run twice to get 
ai$bing<-hms
ai$tweetb<-ifelse(ai$bing > 0, 'positive', ifelse(ai$bing < 0, 'negative', 'neutral'))

You will notice in the Afinn and Bing sentiments that there is a duplicated t which is the transpose function, that is not an error. I was having trouble using the transpose correctly with the data type and was pressed for time so if you have solutions as to how to do this without using this bit twice let me know.

Here is a quick multiplot function that I harvested from somewhere a long time ago. I find it referenced in the wonderful “R Graphics Cookbook” here.

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
 library(grid)
 
 # Make a list from the ... arguments and plotlist
 plots <- c(list(...), plotlist)
 
 numPlots = length(plots)
 
 # If layout is NULL, then use 'cols' to determine layout
 if (is.null(layout)) {
 # Make the panel
 # ncol: Number of columns of plots
 # nrow: Number of rows needed, calculated from # of cols
 layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
 ncol = cols, nrow = ceiling(numPlots/cols))
 }
 
 if (numPlots==1) {
 print(plots[[1]])
 
 } else {
 # Set up the page
 grid.newpage()
 pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
 
 # Make each plot, in the correct location
 for (i in 1:numPlots) {
 # Get the i,j matrix positions of the regions that contain this subplot
 matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
 
 print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
 layout.pos.col = matchidx$col))
 }
 }
}

Let’s plot!

p1<-ggplot(ai,aes(x=text, y=score, fill=tweet))+
 geom_bar(stat='identity', position="identity")+
 theme(axis.ticks = element_blank(), axis.text.y = element_blank(), legend.position='none',axis.title.y=element_blank())+ 
 labs(list(x = "Tweet", y = "NRC Sentiment"))+coord_flip()

p2<-ggplot(ai,aes(x=text, y=afinn, fill=tweeta))+
 geom_bar(stat='identity', position="identity")+
 theme(axis.ticks = element_blank(), axis.text.y = element_blank(), legend.position='none',axis.title.y=element_blank())+ 
 labs(list(x = "Tweet", y = "Afinn Sentiment"))+coord_flip()

p3<-ggplot(ai,aes(x=text, y=bing, fill=tweetb))+
 geom_bar(stat='identity', position="identity")+
 theme(axis.ticks = element_blank(), axis.text.y = element_blank(), legend.position='none',axis.title.y=element_blank())+ 
 labs(list(x = "Tweet", y = "Bing Sentiment"))+coord_flip()
multiplot(p1,p2,p3,cols = 3)

The Y axis gives a line for every single tweet, and the X is the different score values.

This graph can show how each sentiment scorer works differently on a small scale, about 20 tweets. The featured image is what it looks like for about 200.