Heatmaps with Divvy Data

Even at zero degree temps, Divvy has a group of hardcore users.

Fresh off a large project at work, exploring some attendance data, I wanted to share some data exploration and visualization techniques we used. I can’t share the exact project but luckily Divvy, the Chicago bike-sharing copy, posts their stellar data quarterly. This provides us with a really exciting dataset to explore. I have used their data before, here, attempting to visualize their most used routes. This time we will be doing less mapping, and a little wrangling to get rider numbers.  What do we think the data will show? Lots of rides during the summer months probably.

Begin

First off you will need to download all the data from the Divvy data and put the trip CSV’s in a single folder. This will be your working directory within your R project. Great! Now those are downloaded so we need to import them all into R. To do this we will write a mini function to search the working directory for any object with a .csv extension to scan in. (Note: it is important that the only csv’s in your working directory are the trips from Divvy)

Well that took awhile to load. Now that all that data is in we can start tidying it up and adding some variables to help us make the heat map. Looking at the data each of the rows represents a single ride, but we want the count of rides per day. There are many ways to do this but I like to use the aggregate function.

Much better. To make the type of heat map I want to make we will need to populate our data frame with some specific day/week/month information. One problem we have to before we get to that is that the date, Group.1, is actually just a character string. Once we convert that back to a date with the aptly named as.Date we can make the variables we need.

Finally we can make the first heat map. From the first code snippet you can see we have a couple packages instantiated in this instance and we need them to make this heatmap. The package ggthemes provides a nice minimalist theme I like to use to help format the plot.

 

Rplot

Looks cool! With a little napkin-logic we could probably reason out that the “warm” months in Chicago are going to be the prime months of usage because of tourists and the weekend. Well at least this confirms our thoughts. There are a lot of pieces to this data we could filter/expand by. I decided to plot the variance of the mean age of the day’s riders vs. the overall mean average of Divvy riders for the data available: 36.55688. As an aside I think this was very interesting. That mean age seems a bit older than I would have probably guessed, but back to comparing the variance. What I see is sort of awesome…

Ages

The variance is:

The mean age of all the riders that day, subtracted from the mean age of all the riders in the dataset.

With that in mind we can see that during non-peak, and pretty much non-ideal riding weather, the riders are a bit older then the mean*. What does this look like? Let’s take 1/27/2014 for example. The mean temperature for the day was 0 degrees F. That’s cold. There were 406 riders that day. The mean age was 39, but the age range was 19-65! Clearly someone would be questioning if you were trying to ride a Divvy with your child on a 0 degree day, but still 65 is a bit older than the mean age and to be riding on such a cold day as well. As I have mentioned there is a ton more to be done with this data, I would enjoy seeing what other people come up with.

*I am aware that the mean age of riders is misleading because all the users during the summer months drives the mean age down for the rest of the year, and also the age is just an approximation.