Fun with Dendrograms

A dendrogram is a tree type visualization that is commonly used to demonstrate hierarchical clustering.

In the past few months I have started a new job, written a couple fun tutorials, but most exciting is that I have seen research I worked on during my time at the University of Nebraska published in Digital Scholarship in the Humanities (Viewable here). In this article we use various stylometric techniques to develop what linguistics often refers to as an authors voice by analysis of the high frequency function words. (A brief and unjust description, more can be learned about stylometrics here)

Dendrograms came into play when we wrote up the initial report to help visualize the clustering of pieces of text. In my own, and many others, quest to enhance data visualizations (sometimes unsuccessfully on my part) I wanted to experiment with ways to represent this dendrogram data in different ways.

In R the standard dendrogram with the beloved mtcars dataset looks a little something like this.


hc = hclust(dist(mtcars))

Fairly boring looking this is the canned style of dendrogram. It clearly illustrates the steps, distance, and labels so why don’t we spruce it up a bit.


plot(as.phylo(hc), type = "radial", cex = 0.50, cex.lab=.25)

If the distance and height values are not necessary for your dendrogram but merely the specific relationships you can try this method. The cex values dictate the size of the graph elements, and help balance the size of the labels. (The ape package is needed for this example)

mypal <- c("#696969", "#3333CC", "#1B676B", "#FF6B6B")
# cutting dendrogram in 4 clusters
clus18 <- cutree(cluster, 4)
# plot
op <- par(bg = "#FFFFFF")
plot(as.phylo(hc), type = "fan", tip.color = mypal[clus18], label.offset = .3, cex =1, col = "red")

Finally version I believe is the one that keeps the emphasis of data interpretation but also provides more design elements than the canned dendrogram. I could not provide the actual data used in the aforementioned research but I can provide the resulting dendrograms. These last two are the very same examples I have listed here just with a bit more data than the mtcars dataset provides.


Rplot02These two visuals do not put much emphasis on the height values, which can be very important. So in that regard they are not the strongest presentations of the data but if the visualization is meant to just represent the size of the clusters this can be an option. Plus I think they look pretty great as art.