Recently there has been a number of people asking me if I could provide a small walk-through of how to do a similar web scrape in Python. So for those bilinguals out there here it is.
I do want to preface this by saying that I would suggest reading any website’s T.O.S. before using information you got from scraping. I was recently working on a project that we harvested from a lot of different websites and we contacted them and read the T.O.S. to make sure there was no infringement. That being said I usually just use my website in the examples.
There are a couple things you need to know depending on what language you are using, R or Python. For R I prefer to use the Rvest package, which provides a number of uses functions and I believe results in a cleaner product. For Python you will need to know the Xpath of the title/headline/paragraph you are going to want to scrape, and since I am a Chrome user I will say there is a nifty add-on that you can use to find any Xpath for any element on a page called SelectorGadget (or also just inspect element with a right click).
Now we can start.
On my website, I want to grab the paragraphs so that is what I will be pulling in. Unlike in Python this web-scraping works with the html element not the xpath. (hence the use of “p”)
library(rvest) library(plyr) library(XML) website <- html ("http://austinwehrwein.com/") website <- html_nodes(website, "p") website <- sapply(website, xmlValue) website
Python is a little bit different. (I use IPython Notebook)
from lxml import html import requests page = requests.get('http://austinwehrwein.com') tree = html.fromstring(page.text) headlines = tree.xpath('//p/text()') 'W+'.join(headlines) print headlines
The result is just a list of the words from the specific tag you choose. From here you can do a standard wordcloud as seen here, or whatever you want.