Blog

Plane Crash Data - Part 1: Web Scraping

01.08.2017 08:50
by

Several months ago I stumbled across the Kaggle data set Airplane Crashes Since 1908. Since I couldn't find the data source, I searched the web for historical plane crash data and quickly found the web page http://www.planecrashinfo.com. On this site you can find various tables inside tables with lots of information on aviation accidents of the last century. What you also find is a nice exercise on web scraping to collect the data on the web by your own.

This is what I'll do in part one of the plane crash series. The second part treats google maps API requests, here I'm going to retrieve the geocoordinates for the locations in the data set on plane crashes. In the last part of the series I will visualize the flight routes and the crash locations using the leaflet package in R.

# load required packages
library("dplyr")
library("xml2")
library("rvest")
library("tidyr")
library("magrittr")

In order to understand the structure of the website we are going to query, I advise you to go ahaed and visit http://www.planecrashinfo.com, http://www.planecrashinfo.com/2016/2016 and for example http://www.planecrashinfo.com/2016/2016-1. The first site is the landing page. You find recent plane crashes and incidents there:

 

http://www.planecrashinfo.com

 

If you enter the second page, you can find data on all plane crashes happening in 2016:

 

Overview of accidents in 2016 (planecrashinfo.com/2016/2016)

 

By adding a -1 to the url, you access the first entry in the table and get detailed information on that:

 

Details about first crash in 2016 (planecrashinfo.com/2016/2016-1)

 

So, we first generate the URLs for the years 2014, 2015, and 2016.

# select from which years to get plane crash data and create URL for each
years <- c("2014", "2015", "2016")
urlYears <- paste0("http://www.planecrashinfo.com/", years, 
                   "/", years, ".htm") 

Next, we enter the table of the respective year, retrieve the number of entries in the table, and generate specific URLs for all entries.

# get URL for each plane crash table from each year selected

urls <- mapply(function(urlYear, year) {
  rowNumber <- urlYear %>% 
    read_html %>% 
    html_table(header = TRUE) %>% 
    .[[1]] %>% nrow
  
  urlsSpec <- paste0("http://www.planecrashinfo.com/", year, 
                     "/", year, "-", 1:rowNumber, ".htm")
},
urlYears,
years
) %>% unlist 

Now we can access the specific tables and retrieve the information.

# retrieve information 

data <- lapply(urls, function(url) {
  url %>% 
    # read each crash table
    read_html %>% 
    html_table %>% 
    data.frame %>%  
    setNames(c("Vars", "Vals")) %>%
    # header is a colunm and values are in a column -> tidy up
    spread(Vars, Vals) %>% 
    .[-1]
})

We put together the list of data frames to one data set and choose appropriate variable names. We remove the colon and use lower case letters for the variable names.

# data list to data.frame and set appropriate variable names
data %<>% 
  bind_rows %>% 
  setNames(gsub(":", "", colnames(.)) %>% tolower)

Now we choose only the variables relevant for our analysis.

# pick relevant variables
data %<>% select(date, location, route, operator, aboard, fatalities) 

Next, we want to clean up the strings contained in the location variable.

# clean location variable
data %<>% mutate(location = gsub("Near | near", "", location)) %>% 
  mutate(location = gsub(",.*,", ",", location))

Finally, we extract the number of people aboard by splitting the strings of the variable aboard and chosing the first element. We do the same with the variable fatalities in order to extract the number of people who died. Lastly, we calculate a ratio, giving us the percentage of people who died.

# extract total number of people aboard and fatalities 
data$aboard <- strsplit(data$aboard, " ") %>% 
  lapply("[[", 1)  %>% 
  as.numeric

data$fatalities <- strsplit(data$fatalities, " ") %>% 
  lapply("[[", 1) %>% 
  as.numeric

data$deathRate <- round((data$fatalities/data$aboard) * 100, 2)

Now the data is ready to be complemented with geocoordinates and then visualised.

Further parts of the Plane Crash series:

Go back