Quick R: Web Scraping loop

One common challenge research shops have especially in the development world is keeping track of its constituency.

To stay on top of this I have developed a easy to run loop that, using R, allows you to scrape various news websites for your constituency. This code can run off of a two column dataframe or from a csv read into the environment. The first column is your list of website URLs. The second is a column of the XML paths. Ideas for uses are:

New websites to compare constituent names

Obituary websites to track deceased members

Sentiment regarding topics

The basic code underlying is fairly straightforward thanks to the great rvest package. We can easily scrape each element row by row. The logic line by line behind this code is as follows:

  1. Follow and “read” each URL
    1. Harvest the elements we want
    2. Output as text
  2. Remove the awkward symbols
  3. Unlist for sentiment analysis
  4. Create a matrix
  5. Get the Bing sentiment value
  6. Create dataframe of sentiment and article
  7. Clean up
  8. Put it all together

We are currently using this to monitor websites for matches from this web scraping to our constituent member database.