Revisiting Trip Advisor: Extracting Attraction Data

The basic gist of this post is to help scrape attraction specific data from Trip Advisor using a csv of URLs.

This is a little less like a tutorial and a little more like an addendum. There is already fantastic code out there from Hadley Wickham on how to harvest data from hotels on Trip Advisor, but a project came across my desk to harvest data from our attraction. The differences within the html elements are very minute but they can be important.  Awhile ago (a couple years) Trip Advisor changed the way their dates appear on the website and the code Hadley put up tries to convert strings into dates and it runs into a problem. The code in this loop just harvests the dates as strings in order to clean them up afterward. Furthermore this code runs off of a csv of URLs you can make with a gsub function based on your specific attraction.

There is a commented out section that addresses the date issue I had mentioned earlier. Use that if you notice your dates are dropped before a certain date.

To clarify a little more the backbone of this code is all from Hadley’s post. I just wanted to post this tutorial on how to get attraction specific data from a csv of URLs.

Uses for this could be:

  • Comparing multiple attractions in a geographic area
  • Tracking sentiment of your attraction over time
  • Visitor engagement scoring
  • View visitors by location of member
  • Dan

    Hi James,
    thank you, everything worked out like a charm. I was wondering how to code if a node was missing (as it happens frequently when scraping data from Trip Advisor). For example, sometimes the location of a member is missing or not existent and in this case the program halts instead of going ahead and inserting a NA value. Do you have experience with this kind of issues?
    Thank you,
    Dan

  • Austin W.

    Dan,

    I haven’t run into that problem but when I used this for work we binned the user data and threw it into a separate DF for analysis. I am not surprised that the website handles this information differently. To deal with this you could inject a line after the row-bind that handles NULLs as NA. Something similar to:

    df[,rowthing] < - NA

    Let me know how this works for you since I am having trouble replicating and I will add it to the code in case someone else has this problem.
    -Austin

  • Dan

    Hi,
    I had some success with this code:
    member % html_node(“.username”) %>% html_text(trim=TRUE) %>%
    ifelse(identical(., character(0)), NA, .)
    })
    member <- unlist(member)
    Maybe it turns useful for you too.
    Dan

  • Sean

    Thank you Dan for your guidance.

    All the information that I am gathering looks lovely, however I am only able to retrieve a specific amount of text for longer reviews. If the review exceeds so many characters Trip Advisor has a “… More” function that condenses the text to limit space. Any thoughts on overriding this issue to gather the whole review instead of partials with trailing “…\n\n\nMore \n\n\n”? Please let me know if you need more clarification.

    Best!

  • Dan

    You have to use RSelenium together with rvest in order to give more “power” to the script.
    http://johndharrison.github.io/RSOCRUG/#1
    Dan

  • Dan

    Hi,
    I have uploaded the script to download the whole reviews of a specific list of hotels (or attractions) from Trip Advisor.
    It’s just a first version, not very optimized but it does work.
    https://github.com/myliserta/tripadvisor-scrape-full-reviews/
    Dan