Side by Side: Web Scraping in R vs. Python

Recently there has been a number of people asking me if I could provide a small walk-through of how to do a similar web scrape in Python. So for those bilinguals out there here it is.

I do want to preface this by saying that I would suggest reading any website’s T.O.S. before using information you got from scraping. I was recently working on a project that we harvested from a lot of different websites and we contacted them and read the T.O.S. to make sure there was no infringement. That being said I usually just use my website in the examples.

There are a couple things you need to know depending on what language you are using, R or Python. For R I prefer to use the Rvest package, which provides a number of uses functions and I believe results in a cleaner product. For Python you will need to know the Xpath of the title/headline/paragraph you are going to want to scrape, and since I am a Chrome user I will say there is a nifty add-on that you can use to find any Xpath for any element on a page called SelectorGadget (or also just inspect element with a right click).

Now we can start.

R

On my website, I want to grab the paragraphs so that is what I will be pulling in. Unlike in Python this web-scraping works with the html element not the xpath. (hence the use of “p”)

library(rvest)
library(plyr)
library(XML)
website <- html ("http://austinwehrwein.com/")
website <- html_nodes(website, "p")
website <- sapply(website, xmlValue)
website

Python

Python is a little bit different. (I use IPython Notebook)

from lxml import html
import requests

page = requests.get('http://austinwehrwein.com')
tree = html.fromstring(page.text)

headlines = tree.xpath('//p/text()')

'W+'.join(headlines)

print headlines

The result is just a list of the words from the specific tag you choose. From here you can do a standard wordcloud as seen here, or whatever you want.