---
title: "SDM4 in R: Multiple Regression Wisdom (Chapter 29)"
author: "Nicholas Horton (nhorton@amherst.edu), Patrick Frenett, and Sarah McDonald"
date: "June 21, 2018"
output: 
  pdf_document:
    fig_height: 2.8
    fig_width: 7
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
require(mosaic)
options(digits = 4)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE,     # display code as typed
  size = "small"    # slightly smaller font for code
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced 
as examples in the Fourth Edition of *Stats: Data and Models* (2014) by De Veaux, Velleman, and Bock.
More information about the book can be found at http://wps.aw.com/aw_deveaux_stats_series.  This
file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/sdm4.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

## Chapter 29: Multiple Regression Wisdom


```{r message = FALSE}
library(mosaic)
library(readr)
Coasters <- read_csv("http://nhorton.people.amherst.edu/sdm4/data/Roller_coasters_2014.csv")
glimpse(Coasters)
tally(~ Speed, data = Coasters)
filter(Coasters, Speed == "*100" | Speed == "*56")
```
Note that two of the observations for speed have special characters in them.
We first need to address these data irregularities:
```{r}
library(readr)
Coasters <- mutate(Coasters, Speed = parse_number(Speed))
```
To match the output we need to remove these two coasters.

```{r}
dim(Coasters)
Coasters <- filter(Coasters, 
  !(Name %in% c("Tower of Terror", "Wild Beast")))
dim(Coasters)
```


Figure 29.1 (page 860) displays the scatterplot of duration as a function
of speed.

```{r warning = FALSE}
gf_point(Duration ~ Speed, xlab = "Speed (mph)",
  ylab = "Duration (sec)", data = Coasters) %>%
  gf_lm()
```

We can also replicate the regression model (and regression diagnostics)
on the bottom of page 860.

```{r}
mod1 <- lm(Duration ~ Speed, data = Coasters)
msummary(mod1)
gf_point(resid(mod1) ~ fitted(mod1), xlab = "Predicted (sec)",
  ylab = "Residuals (sec)") %>%
  gf_lm()
```

### Section 29.1: Indicators

The model displayed on the bottom of page 861 is restricted to 
roller coasters with no inversions.
```{r}
noinversion <- filter(Coasters, Inversions == 0)  # 0 is No (or FALSE)
mod2 <- lm(Duration ~ Speed, data = noinversion)
msummary(mod2)
```

We can do the same for coasters with inversions:
```{r}
mod3 <- lm(Duration ~ Speed, data = filter(Coasters, Inversions == 1))
msummary(mod3)
```

The model at the top of page 862 adds the Inversion indicator to the
model.

```{r}
mod4 <- lm(Duration ~ Speed + Inversions, data = Coasters)
msummary(mod4)
mod4fun <- makeFun(mod4)
```

We can generate predicted values from this model.
```{r}
mod4fun(Speed = 55, Inversions = 0)  # Hayabusa
mod4fun(Speed = 60.3, Inversions = 1)  # Hangman
```

On page 864 the Burger King example is introduced with a subset of items from
the menu.
```{r}
BK <- read.csv("http://nhorton.people.amherst.edu/sdm4/data/BK_items.csv", 
               stringsAsFactors = FALSE)
BKmod <- lm(Calories ~ Carbs.g. + Meat. + Carbs.g.*Meat., data = BK)
msummary(BKmod)
```
The same predicted values can be calculated using this model.

```{r}
BKfun <- makeFun(BKmod)
BKfun(Carbs.g = 53, Meat = "Y")
BKfun(Carbs.g = 43, Meat = "N")
```

### Section 29.2: Diagnosing Multiple Regression Models

Below are 6 plots made using the `mplot` function. These all work to better inform decisions when interpreting extreme values / the appropriateness of the model. Whilst the normal quantile and residual plots test the assumptions of the multiple regression model, the plots involving Cook's Distance and leverage help us identify possible outliers. 

```{r message = FALSE}
BK <- mutate(BK, resid = resid(BKmod), fitted = fitted(BKmod),
             student = rstudent(BKmod), CooksD = cooks.distance(BKmod))
mplot(BKmod, which = 1)
mplot(BKmod, which = 2)
mplot(BKmod, which = 3)
mplot(BKmod, which = 4)
favstats(~ CooksD, data = BK)
filter(BK, Carbs.g. > 66)
mplot(BKmod, which = 5)
mplot(BKmod, which = 6)
```

### Section 29.3: Building Multiple Regression Models

Using the Infant Mortality data used in the step by step example on page 874, we can create a linear model of all the variables except `State`. To do this we can use `.` to specify adding all the variables to the model and then `-State` to remove `State` from the model.  (Note that the dataset made available with the textbook does not correspond to the analyses in the book, so it is not possible to completely replicate the analyses in this section.)

```{r message=FALSE}
InfantMortality <- read_csv("http://nhorton.people.amherst.edu/sdm4/data/Infant_Mortality.csv")

lmInfant <- lm(Infantmort ~ . -State, data = InfantMortality)
msummary(lmInfant)
gf_point(rstudent(lmInfant) ~ fitted(lmInfant))
gf_dhistogram(~ rstudent(lmInfant), binwidth = 0.5, center = 0.25)
```

The `summary` output shows that only two of the variables are significant at the 5% level. This could be due to multicollinearity. A correlation matrix will help identify variables that are strongly correlated.

```{r}
select(InfantMortality, -State) %>%
  cor()
```

From this we see that `HSdrop`, `TeenBirths` and `TeenDeaths` are correlated. They are also the three least significant variables in the full model so it might not be unreasonable to remove them from the model.

```{r}
lmInfant2 <- lm(Infantmort ~ . -State -TeenDeaths -TeenBirths -HSdrop, data = InfantMortality)
msummary(lmInfant2)
```

Notice that the $R^2$ of this final model is 0.689: this is different from that reported in the book. 

There are a number of approaches when choosing the 'best' model and that different models may be better or worse given the constraints/specifications of the problem: much of this is left for more advanced courses.