‘Digging’ into archaeology + leaflet

What do you call a really old joke? Pre-hysterical

I was talking with a friend recently, who is still doing graduate work, about some projects he has always wanted to do. We chatted about a few things and then he mentioned he always wanted to make a map of the archaeology sites in North America. ‘Of course it should be interactive’ he said, thinking that it would be great to see how close someone was to the nearest one. We talked about the data, which he said was probably going to be hard to find, and what the end goal could look like. Then as if to prod me into trying it he said, ‘… oh you like making maps and stuff don’t you?’. Yes. Yes I do. Also here is a link to the code or you can grab it from below.



Data cleaning and prep

First of all we need the data and originally it comes in KML form from the AIA website. KML in the past has given me lots of trouble such as tags being nested at different levels to the same data represented across multiple tags. I had tried everything from the xml package, followed Stack Overflow religiously, but couldn’t figure it out. So I just decided to turn it into a format I could deal with.



library(XML)
library(dplyr)
library(leaflet)
library(htmltools)
kml.text <- readLines("http://www.archaeological.org/Archaeology%20of%20America%20and%20Canada.kml")  
list<-xmlToList(kml.text)
raw<-unlist(list)
raw.df<-data.frame(var=as.character(attr(raw,'')),data=as.character(raw),stringsAsFactors = F)

Now that we have a dataframe we can search it for the data and associated tags. The resulting dataframe should look like something similar to the image below. Then it is as simple as finding examples of the data we want and using dplyr::subset to get them.



names<-subset(raw.df,var=='Document.Folder.Folder.Placemark.name')
#clean coords
coords<-subset(raw.df,var=='Document.Folder.Folder.Placemark.Point.coordinates')
coords$long<-as.numeric(lapply(strsplit(coords$data,','),"[",1))
coords$lat<-as.numeric(lapply(strsplit(coords$data,','),"[",2))
data<-data.frame(place=names$data,long=coords$long,lat=coords$lat,stringsAsFactors = F)
desc<-subset(raw.df,var=='Document.Folder.Folder.Placemark.ExtendedData.Data.value')

After running this we can see that we have ~496 individual sites. The names and coords pieces came out just fine, but the desc element is a mess! This is because the specific var that houses the description is used for other elements as well. We will have to clean those up too.



#clean desc
not<-grep('/|_',desc$data,value=T)
final<-desc[!desc$data %in% not,]
final$len<-sapply(gregexpr("\\W+", final$data), length) + 1
final<-subset(final,len >7)
#image
imgs<-subset(these,var=='Document.Folder.Folder.Placemark.ExtendedData.Data.value')
noti<-grep('wikipedia',imgs$data,value=T)
imgs<-desc[imgs$data %in% noti,]

What we did here was strip out any that were URLs. Then calculated the length, and used the length to subset the group to those that are most likely descriptions. Once those are gathered we loop through the list to match places, to descriptions, and try and grab some of the URLs as well. This next section is not as elegant as I would like it and I am sure there are ways. If you have any suggestions please shoot them my way.



agg<-NULL
for(i in 1:nrow(data)){
  id<-data$place[i]
  temp.desc<-grep(id,final$data,value=T)
  urls<-strsplit(tolower(imgs$data),'/')
  img<-grep(id,imgs$data,value = T,fixed = T)
  len<-length(temp.desc)>0
  leni<-length(img)>0
  temp.desc<-ifelse(len==T,temp.desc,'NA')
  img<-ifelse(leni,img,'NA')
  found<-data.frame(place=id,desc=temp.desc,img=img,stringsAsFactors = F)
  agg<-rbind(found,agg)
}
all<-merge(data,agg,by=c('place'))
all<-all[!duplicated(all$place),]
all$color<-ifelse(all$desc=='NA','#444444','#84bc5b')
all$info<-ifelse(all$desc=='NA','Info Unavailable','Available Info')

After the loop we merge the data, remove some that may have been duplicated, and assign labels and colors so we can plot. The following leaflet may look a bit complex but I wanted to provide links to images for the site when my matching identified one. So the popup element is generated all the text and formatting you see in the example and when you click on an individual site.



mound<-subset(all,place=='Moundville')
     leaflet(all) %>% addProviderTiles("CartoDB.Positron")%>%
       setView(lng = -99.29, lat = 41.1987, zoom = 4)%>%
       addCircleMarkers(~long, ~lat,radius=ifelse(all$desc=='NA',2,5),color=~color,popup = paste(sep="
",ifelse(all$img!='NA',paste0("',all$place,'',""),paste("",all$place,"")),paste('(',all$long,',',all$lat,')'),all$desc),group=~info)%>% addPopups(mound$long, mound$lat, popup = paste(sep="
",ifelse(mound$img!='NA',paste0("',mound$place,'',""),paste("",mound$place,"")),paste('(',mound$long,',',mound$lat,')'),mound$desc))%>% addLayersControl( overlayGroups = ~info, options = layersControlOptions(collapsed = FALSE) )