Web Scraping and Word Clouds

This tutorial will be a walk-through of how to scrape a website for the information you want, in our case the blog text, and then get the data we want to plot into a nice word cloud.

The first step is to make sure you have these packages: RCurl, XML, plyr, wordcloud, RColorBrewer

IF you find you do not have these packages copy and paste these next few lines to install them.

install.packages('RCurl')
install.packages('XML')
install.packages('plyr')
install.packages('wordcloud')
install.packages('RColorBrewer')

Then load them

library(RCurl)
 library(XML)
 library(plyr)
 library(wordcloud)
 library(RColorBrewer)

Next we grab the url we want to scrape information from.

html <- getURL("http://austinwehrwein.com", followlocation = TRUE)

Then we clean up the text we get from it. You should check the size of this to make sure you believe it grabbed enough of the text you wanted.

doc <- htmlParse(html, asText=TRUE)
webtext <- xpathSApply(doc, "//p", xmlValue)
words.l<-strsplit(webtext, "\W")

Then we take the words and create a data frame that we can add a frequency count to.

text<-unlist(words.l)
d<-as.data.frame(text)

Now we take the data and clean it up by removing duplicates and leaving the counts of occurrence.

density<- ddply(d, .(text), "nrow")
names(density)[2] <- "count"
AW<- merge(d, density)
duplicated(AW$text)
AW[duplicated(AW$text),]
unique(AW[duplicated(AW$text),])
AW<-AW[!duplicated(AW$text),]

Finally we plot the information in a wordcloud, the scale and color you can mess around with to customize how you would like.

require('wordcloud')
require('RColorBrewer')
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud_packages.png", width=250,height=200)
wordcloud(AW$text, AW$count, scale=c(4,.5),min.freq=1,max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()

I want to drop a couple notes here at the end, if you run this code all the way through it will print a wordcloud png where your RData or working directory is currently set to. If you want the wordcloud plot to print in your RStudio plots window then you must rerun the wordcloud line after running dev.off().

This is also a bit more into the Digital Humanities side of this subject but what you may notice if you do blogs and pages where it is mostly writing the word  cloud will most likely end up being the high frequency function words. (the, and, of, etc.) Which can be useful in many areas, see my undergraduate work and authorship attribution.

Try doing this with other websites. This scraper grabs text within <p> tags which given the structure of the web today may or may not be exactly what you want. Feel free to try this out, drop me a line, ask questions and suggest other methods.

Here is what the wordcloud image of this very website looks like

austindata

 

Hope you enjoy,

Austin