--- title: 'Stats 101: Better flight experiences with data (airline delays in New York City)' author: "Nicholas J. Horton (Amherst College) and Ben Baumer (Smith College)" date: "December 7, 2015" output: pdf_document: fig_height: - 3.5 fig_width: - 7 keep_tex: - yes html_document: fig_height: - 3 fig_width: - 5 word_document: fig_height: - 3 fig_width: - 5 --- ```{r include=FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages library(mosaic); library(lubridate) ``` ```{r include=FALSE} # Some customization. You can alter or delete as desired (if you know what you are doing). # This changes the default colors in lattice plots. trellis.par.set(theme=theme.mosaic()) options(digits=3) # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy=FALSE, # display code as typed size="small" # slightly smaller font for code ) ``` Statistics students (and instructors) need experience wrestling with large, messy, complex, challenging data sets, for which there is no obvious goal or specially-curated statistical method. In this example, we consider a case study from a subset of the 180 million record Airline Delays dataset (see http://stat-computing.org/dataexpo/2009) that includes n=336,776 domestic commercial flights originating in New York City area airports (Newark, JFK, and LaGuardia) in 2013. These data are made available as a series of comma separated variable (CSV) files or through Hadley Wickham's `nycflights13` package on CRAN and allow students to explore a variety of statistical questions related to flights from NYC airports. These five separate datasets can easily be merged (see the appendix for a list of the first few observations in each of these tables.) More details and extended examples can be found at http://www.amherst.edu/~nhorton/precursors. ```{r, echo=FALSE, eval=TRUE, message=FALSE} require(mosaic); require(nycflights13) ``` ```{r echo=FALSE} # derive variables of interest... len <- nchar(flights$dep_time) hour <- as.numeric(substring(flights$dep_time, 1, len-2)) min <- as.numeric(substring(flights$dep_time, len-1, len)) flights <- mutate(flights, deptime = hour+min/60) flights <- mutate(flights, realdelay = ifelse(is.na(arr_delay), 240, arr_delay)) ``` Students can use this dataset to address questions that they find real and relevant. (It is not hard to find motivation for investigating patterns of flight delays. Ask students: have you ever been stuck in an airport because your flight was delayed or cancelled and wondered if you could have predicted the delay if you'd had more data?) This dataset is attractive because it is more similar to what analysts actually see in the wild than what is typically taught in the introductory statistics classroom. #### Flights to San Francisco Bay We start with an analysis focused on three airports in the San Francisco Bay area (OAK, SFO, and SJC) for flights that depart from New York City airports. ```{r warning=FALSE, message=FALSE, echo=FALSE, results="asis"} require(xtable) options(xtable.comment = FALSE) xtab <- xtable(filter(airports, faa %in% c('SFO', 'OAK', 'SJC'))) print(xtab) ``` How many flights are there to each airport in January, 2013? ```{r echo=FALSE, results="asis"} airportcounts <- flights %>% filter(dest %in% c('SFO', 'OAK', 'SJC')) %>% group_by(year, month, dest) %>% summarise(count = n()) xtable(filter(airportcounts, month==1)) ``` Almost all are to San Francisco International (SFO). Let's take a closer look at what carriers service this route. ```{r echo=FALSE, results="asis", message=FALSE} airlines <- mutate(airlines, name=as.character(name), carrier=as.character(carrier)) sfoflights <- inner_join(filter(flights, dest=="SFO"), airlines) tab1 <- tally(~ name, margins=TRUE, data=sfoflights) tab2 <- tally(~ name, format="percent", margins=TRUE, data=sfoflights) tab <- cbind(count=tab1, percent=tab2) xtable(tab) ``` United is the largest carrier (it accounts for more than half of the flights). #### Are there different delays by carrier? Each of the carriers has at least a thousand flights, so it's likely that estimates of arrival delays may be reasonable to estimate. Let's calculate summary statistics of the arrival delay for the flights to SFO by carrier. ```{r echo=FALSE, warning=FALSE, results="asis"} xtable(favstats(arr_delay ~ name, data=sfoflights)) ``` The "average" results (as provided by the median) is that flights arrive a few minutes early for each of these carriers. And even the 3rd quartile or the mean are relatively modest delays (all less than 20 minutes after the scheduled arrival time). But the maximum delays can be large (e.g., more than 10 hours for Virgin America and American Airlines). We also observe that a number of flights are missing their arrival delay. Those missing values are likely cancelled flights. We might be interested in which month they occurred? ```{r echo=FALSE} tab <- t(tally(month ~ 1, margin=TRUE, data=filter(sfoflights, is.na(arr_delay)))) tab <- tab[,-13] barchart(tab, stack=FALSE, xlab="Number of cancellations per month") ``` Cancelled flights seem to be most common in July, February, and December. How should the cancelled flights be handled? (Note that there were excluded from the calculation of the summary statistics displayed earlier.) One option might be to recode these as 4 hour (240 minute) delays, since it's likely that if a flight is cancelled the `expected` delay might be of that duration on average. (This is an arbitrary choice: students might be asked what other options are reasonable. More sophisticated approaches could implement a "hurdle" method with a model for the probability of not being cancelled along with a model for the "average" delay for those flights that were not cancelled.) Let's revisit the distribution of real delays (accounting for cancellations) by carrier. ```{r echo=FALSE, warning=FALSE, results="asis"} xtable(favstats(realdelay ~ name, data=sfoflights)) ``` A parallel graphical description of the flights delays to San Francisco airport can be used to judge the airlines. ```{r echo=FALSE, warning=FALSE} densityplot(~ realdelay, group=name, auto.key=TRUE, xlim=c(-120, 300), xlab="delay (including cancelled flights)", data=sfoflights) bwplot(name ~ realdelay, xlim=c(-120, 300), xlab="delay (including cancelled flights)", data=sfoflights) ``` Note that the distributions have been rescaled so that only those flights between 2 hours early and 5 hours late are displayed (this excludes some of the extreme outliers). The distributions appear to be somewhat symmetrically distributed around zero delays but with extremely long right tails. Different information is conveyed in the two representations: the overlapping density plots provide a clear sense of the shape of the distributions but are somewhat crowded. The boxplots make it easy to compare airline reliability, and to see the quantiles. #### Is there seasonality to the number of flights? We can consider whether the number of flights changes month by month. ```{r, message=FALSE, echo=FALSE} sfocounts <- filter(airportcounts, dest=="SFO") %>% mutate(Date = ymd(paste(year, "-", month, "-01", sep=""))) xyplot(count ~ Date, type=c("p","l"), lwd=2, xlab="Month", ylab="Number of flights\nto SFO per month", data=sfocounts) ``` We observe that there are some interesting patterns over the course of the year for SFO: the number of flights in January, February, and March is considerably less than the other nine months. #### Predictors of delays How is month of flight associated with delays? ```{r, echo=FALSE} sfocounts <- sfoflights %>% mutate(Date = ymd(paste(year, "-", month, "-01", sep=""))) %>% group_by(Date) %>% summarise(count = n(), avgdelay = mean(realdelay)) xyplot(avgdelay ~ Date, type=c("p","l"), lwd=2, xlab="Month", ylab="Average delay of flights\nto SFO per month", data=sfocounts) ``` We see that the largest average delays occur in the summer (with both June and July having an average above 20 minutes). Is there an association between the number of flights in a month and the average delay? ```{r, echo=FALSE} panel.labels <- function(x, y, labels='x',...) { panel.text(x, y, labels, cex=0.8, ...) } xyplot(avgdelay ~ count, type=c("p"), lwd=2, panel=panel.labels, labels=month(sfocounts$Date, label=TRUE), xlab="Number of flights to SFO in that month", ylab="Average delay of flights\nto SFO per month", data=sfocounts) ``` There is not much of a pattern, but the delays seem to be more variable on months with more flights. Another question that travelers might consider is whether the departure time matter as a predictor of flight delays? ```{r, echo=FALSE} xyplot(realdelay ~ deptime, cex=0.3, type=c("p"), xlab="hour (since midnight)", ylab="Flight delay (with recoded cancelled flights)", ylim=c(-60, 250), data=sfoflights) ``` A number of observations can be made from this graphical display. Very few flights depart between midnight and 5:30am. Most flights are on time, but there does appear to be a pattern that more delays occur for flights that are scheduled to depart later in the day. We can improve the display by zooming in and adding a scatterplot smooth. ```{r, echo=FALSE} xyplot(realdelay ~ deptime, cex=0.2, type=c("p", "smooth"), xlab="hour (since midnight)", ylab="Flight delay (with recoded cancelled flights)", ylim=c(-30, 30), data=sfoflights) ``` While there is some indication that delays tend to be more common (and slightly longer) as the day proceeds, the effect is modest for flights to San Francisco. #### Weather Other factors affect airline delays. This might include the weather. The `nycflights13` package in R includes other data scraped from the Internet (in this case detailed weather information). We can display the temperature (in degrees Fahrenheit) on New Year's Day, 2013. ```{r echo=FALSE, results="asis"} xtable(select(weather, hour, hour, dewp, humid, wind_speed, wind_dir, precip, pressure) %>% head()) ``` ```{r echo=FALSE} xyplot(temp ~ hour, type=c("p", "l"), ylab="Temperature (degrees F) on\nNew Year's Day", data=filter(weather, month==1 & day==1)) ``` Let's take a look at daily averages for delays as well as total precipation and maximum wind speed. First we undertake the merge and display a set of values. ```{r, echo=FALSE, results="asis"} avgdelay <- flights %>% group_by(month, day) %>% filter(month < 13) %>% summarise(avgdelay = mean(realdelay, na.rm=TRUE)) precip <- weather %>% group_by(month, day) %>% filter(month < 13) %>% summarise(totprecip = sum(precip), maxwind = max(wind_speed)) precip <- mutate(precip, anyprecip = ifelse(totprecip==0, "No", "Yes")) merged <- left_join(avgdelay, precip, by=c("day", "month")) xtable(head(merged)) ``` A dramatic outlier is immediately spotted: windspeeds of 1000 mph are not common! This must be an error. ```{r, echo=FALSE, results="asis"} xtable(favstats(~ maxwind, data=merged)) ``` ```{r, echo=FALSE, results="asis"} xtable(filter(merged, maxwind > 1000)) ``` Let's remove this outlier and consider the association between any precipiation and average delays. ```{r, echo=FALSE} merged <- filter(merged, maxwind < 1000) bwplot(avgdelay ~ anyprecip, main="Association of delay with any precipitation", data=merged) ``` Precipitation seems to be associated with delays: ```{r, echo=FALSE} xyplot(avgdelay ~ maxwind, type=c("p", "smooth"), xlab="Max wind speed", ylab="Average daily delay\n(including cancellations)", data=merged) ``` Max windspeed also seems to be associated with delays. ```{r, echo=FALSE} xyplot(avgdelay ~ maxwind, groups=anyprecip, auto.key=TRUE, type=c("p", "smooth"), ylab="Average daily delay\n(including cancellations)", data=merged) ``` After stratifying by precipitation status, we see that windspeed does not appear to be a major determinant of delays. Precipitation seems to be the issue. ## Closing thoughts and further resources The dataset (as a series of comma separated variable files), copies of the R Markdown and formatted files for these analyses (to allow replication of the analyses) along with further background on the Airline Delays dataset can be found at http://www.amherst.edu/~nhorton/precursors. Horton, N.J., Baumer, B.S., and Wichham H. (2015) Setting the stage for data science: integration of data management skills in introductory and second courses in statistics", *CHANCE*, 28(2):40-50, http://chance.amstat.org/2015/04/setting-the-stage. Kane, M. Strategies for analyzing a 12-gigabyte data set: airline flight delays (2015) in *Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving*, Nolan D. and Temple Lang D, CRC Press. Wickham, H. (2011). ASA 2009 Data Expo, *Journal of Computational and Graphical Statistics*, 20(2):281-283. #### Appendix In this appendix, the first few rows of each of the datasets is displayed. ```{r, echo=TRUE, eval=TRUE, message=FALSE} airlines airports planes flights weather ``` ```{r echo=FALSE} write.csv(airlines, file="stats101-airlines.csv") write.csv(airports, file="stats101-airports.csv") write.csv(planes, file="stats101-planes.csv") write.csv(flights, file="stats101-flights.csv") write.csv(weather, file="stats101-weather.csv") write.csv(sfoflights, file="stats101-sfoflights.csv") ```