Introduction and background

This document is intended to help describe how to undertake analyses introduced as examples in the Fourth Edition of (2014) by De Veaux, Velleman, and Bock. More information about the book can be found at http://wps.aw.com/aw_deveaux_stats_series. This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/sdm4.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the mosaic package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic).

Chapter 29: Multiple Regression Wisdom

library(mosaic); library(readr)
Coasters <- read_csv("http://nhorton.people.amherst.edu/sdm4/data/Roller_coasters_2014.csv")
glimpse(Coasters)
## Observations: 198
## Variables: 10
## $ Name       <chr> "Top Thrill Dragster", "Superman The Escap", "Mille...
## $ Park       <chr> "Cedar Point", "Six Flags Magic Mountain", "Cedar P...
## $ Track      <chr> "Steel", "Steel", "Steel", "Steel", "Steel", "Steel...
## $ Speed      <chr> "120", "100", "93", "85", "85", "82", "82", "80", "...
## $ Height     <dbl> 420.0, 415.0, 310.0, 235.0, 245.0, 160.0, 205.0, 20...
## $ Drop       <dbl> 400.0, 328.1, 300.0, 255.0, 255.0, 228.0, 130.0, 22...
## $ Length     <dbl> 2800, 1235, 6595, 4500, 5312, 3200, 2202, 5843, 156...
## $ Duration   <int> NA, NA, 165, 180, 210, NA, 62, 163, NA, 240, NA, NA...
## $ Inversions <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, ...
## $ Opened     <int> 2003, 1997, 2000, 2000, 2001, 2001, 2002, 1909, 200...
tally(~ Speed, data=Coasters)
## Speed
##  *100   *56   100   120 149.1  16.2  17.4  21.8  22.4  23.6  25.7  26.7 
##     1     1     1     1     1     1     1     1     2     1     3     1 
##    28  28.6  29.1  32.9  34.2    35    36    37  37.3  38.5   4.5    40 
##     5     1     3     1     1     1     1     1     2     1     1     2 
##  40.4    41    42  43.5  44.7    45    46  46.6    47  47.9  49.7    50 
##     1     1     1     6     2     3     1     3     6     2    11     6 
##  50.1    52  52.8    53  53.7  54.7    55  55.9    56    59    60  60.3 
##     1     3     6     2     1     1    11     3     2     1     1     2 
##    62  62.1    63  64.8    65  65.2  65.3  65.6    66  66.3    67  67.1 
##     4     8     4     1    10     1     7     4     2     1     4     2 
##    68    70  71.5    72    73    74  74.6    75    76    77  78.4    80 
##     4     3     1     1     4     1     1     4     1     1     1     5 
##  80.8    82    83  83.3    85    90    93    95 
##     2     2     1     1     2     1     1     1
filter(Coasters, Speed=="*100" | Speed=="*56")
## # A tibble: 2 × 10
##              Name                          Park Track Speed Height  Drop
##             <chr>                         <chr> <chr> <chr>  <dbl> <dbl>
## 1 Tower of Terror                    Dreamworld Steel  *100  377.3 328.1
## 2      Wild Beast Paramount Canada's Wonderland  Wood   *56  415.0  78.0
## # ... with 4 more variables: Length <dbl>, Duration <int>,
## #   Inversions <int>, Opened <int>

Note that two of the observations for speed have special characters in them. We first need to address these data irregularities:

library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:Matrix':
## 
##     expand
Coasters <- mutate(Coasters, Speed = extract_numeric(Speed))
## extract_numeric() is deprecated: please use readr::parse_numeric() instead

To match the output we need to remove these two coasters.

dim(Coasters)
## [1] 198  10
Coasters <- filter(Coasters, 
  !(Name %in% c("Tower of Terror", "Wild Beast")))
dim(Coasters)
## [1] 196  10

Figure 29.1 (page 860) displays the scatterplot of duration as a function of speed.

xyplot(Duration ~ Speed, type=c("p", "r"), xlab="Speed (mph)",
  ylab="Duration (sec)", data=Coasters)

We can also replicate the regression model (and regression diagnostics) on the bottom of page 860.

mod1 <- lm(Duration ~ Speed, data=Coasters)
msummary(mod1)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.474     12.925    1.58     0.12    
## Speed          1.823      0.219    8.34  8.6e-14 ***
## 
## Residual standard error: 36.4 on 132 degrees of freedom
##   (62 observations deleted due to missingness)
## Multiple R-squared:  0.345,  Adjusted R-squared:  0.34 
## F-statistic: 69.6 on 1 and 132 DF,  p-value: 8.59e-14
xyplot(resid(mod1) ~ fitted(mod1), xlab="Predicted (sec)",
  ylab="Residuals (sec)", type=c("p", "r"))

Section 29.1: Indicators

The model displayed on the bottom of page 861 is restricted to roller coasters with no inversions.

noinversion <- filter(Coasters, Inversions==0)  # 0 is No (or FALSE)
mod2 <- lm(Duration ~ Speed, data=noinversion)
msummary(mod2)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   33.945     13.313    2.55    0.014 *  
## Speed          1.795      0.214    8.38  2.8e-11 ***
## 
## Residual standard error: 31.4 on 53 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.57,   Adjusted R-squared:  0.562 
## F-statistic: 70.2 on 1 and 53 DF,  p-value: 2.79e-11

We can do the same for coasters with inversions:

mod3 <- lm(Duration ~ Speed, data=filter(Coasters, Inversions==1))
msummary(mod3)
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   22.773     27.712    0.82   0.4137   
## Speed          1.635      0.486    3.36   0.0012 **
## 
## Residual standard error: 37.6 on 77 degrees of freedom
##   (32 observations deleted due to missingness)
## Multiple R-squared:  0.128,  Adjusted R-squared:  0.117 
## F-statistic: 11.3 on 1 and 77 DF,  p-value: 0.00121

The model at the top of page 862 adds the Inversion indicator to the model.

mod4 <- lm(Duration ~ Speed + Inversions, data=Coasters)
msummary(mod4)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   35.997     13.343    2.70   0.0079 ** 
## Speed          1.760      0.212    8.31    1e-13 ***
## Inversions   -20.273      6.187   -3.28   0.0013 ** 
## 
## Residual standard error: 35.1 on 131 degrees of freedom
##   (62 observations deleted due to missingness)
## Multiple R-squared:  0.395,  Adjusted R-squared:  0.385 
## F-statistic: 42.7 on 2 and 131 DF,  p-value: 5.24e-15
mod4fun <- makeFun(mod4)

We can generate predicted values from this model.

mod4fun(Speed=55, Inversions=0)  # Hayabusa
##     1 
## 132.8
mod4fun(Speed=60.3, Inversions=1)  # Hangman
##     1 
## 121.9

On page 864 the Burger King example is introduced with a subset of items from the menu.

BK <- read.csv("http://nhorton.people.amherst.edu/sdm4/data/BK_items.csv", 
               stringsAsFactors = FALSE)
BKmod <- lm(Calories ~ Carbs.g. + Meat. + Carbs.g.*Meat., data=BK)
msummary(BKmod)
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       137.40      58.72    2.34   0.0267 * 
## Carbs.g.            3.93       1.11    3.53   0.0014 **
## Meat.Y            -26.16      98.48   -0.27   0.7925   
## Carbs.g.:Meat.Y     7.88       2.18    3.61   0.0012 **
## 
## Residual standard error: 106 on 28 degrees of freedom
## Multiple R-squared:  0.781,  Adjusted R-squared:  0.757 
## F-statistic: 33.2 on 3 and 28 DF,  p-value: 2.32e-09

The same predicted values can be calculated using this model.

BKfun <- makeFun(BKmod)
BKfun(Carbs.g=53, Meat="Y")
##     1 
## 737.1
BKfun(Carbs.g=43, Meat="N")
##     1 
## 306.5

Section 29.2: Diagnosing Multiple Regression Models

Below are 6 plots made using the mplot function. These all work to better inform decisions when interpreting extreme values / the appropriateness of the model. Whilst the normal quantile and residual plots test the assumptions of the multiple regression model, the plots involving Cook’s Distance and leverage help us identify possible outliers.

BK <- mutate(BK, resid = resid(BKmod), fitted=fitted(BKmod),
             student=rstudent(BKmod), CooksD=cooks.distance(BKmod))
mplot(BKmod, which=1)
## [[1]]

mplot(BKmod, which=2)
## [[1]]

mplot(BKmod, which=3)
## [[1]]

mplot(BKmod, which=4)
## [[1]]

favstats(~ CooksD, data=BK)
##        min       Q1  median     Q3    max   mean      sd  n missing
##  6.114e-06 0.001238 0.01197 0.0325 0.4865 0.0426 0.09613 32       0
BK[11,]   # or filter(BK, Carbs.g. > 0.486)
##           Item Calories Protein Total.Fat Carbs.g.   Na.S Meat.  resid
## 11 BK Big Fish      710      24        38       67 4.5627     Y -192.4
##    fitted student CooksD
## 11  902.4  -2.323 0.4865
mplot(BKmod, which=5)
## [[1]]

mplot(BKmod, which=6)
## [[1]]

Section 29.3: Building Multiple Regression Models

Using the Infant Mortality data used in the step by step example on page 874, we can create a linear model of all the variables exept State. To do this we can use . to specify adding all the variables to the model and then -State to remove State from the model.

InfantMortality <- read_csv("http://nhorton.people.amherst.edu/sdm4/data/Infant_Mortality.csv")

lmInfant <- lm(Infantmort ~ . -State, data = InfantMortality)
msummary(lmInfant)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.63168    0.91240    1.79    0.081 .  
## CDR          0.03123    0.01385    2.25    0.029 *  
## HSdrop      -0.09971    0.06105   -1.63    0.110    
## lowBW        0.66103    0.11891    5.56  1.5e-06 ***
## TeenBirths   0.01357    0.02380    0.57    0.571    
## TeenDeaths   0.00556    0.01128    0.49    0.624    
## 
## Residual standard error: 0.752 on 44 degrees of freedom
## Multiple R-squared:  0.713,  Adjusted R-squared:  0.68 
## F-statistic: 21.8 on 5 and 44 DF,  p-value: 6.31e-11
xyplot(rstudent(lmInfant) ~ fitted(lmInfant))

histogram(~ rstudent(lmInfant), width = 0.5, center = 0.25)

The summary output shows that only two of the variables are significant at the 5% level. This could be due to multicollinearity. A correlation matrix will help identify variables that are strongly correlated to one and other.

cor(InfantMortality[,-1])
##            Infantmort    CDR HSdrop  lowBW TeenBirths TeenDeaths
## Infantmort     1.0000 0.6522 0.2536 0.7583     0.5417     0.5214
## CDR            0.6522 1.0000 0.4195 0.4662     0.5917     0.8116
## HSdrop         0.2536 0.4195 1.0000 0.3761     0.7593     0.2982
## lowBW          0.7583 0.4662 0.3761 1.0000     0.6215     0.3160
## TeenBirths     0.5417 0.5917 0.7593 0.6215     1.0000     0.4456
## TeenDeaths     0.5214 0.8116 0.2982 0.3160     0.4456     1.0000

From this we see that HSdrop, TeenBirths and TeenDeaths are correlated. They are also the 3 least significant variables in the full model so it might not be unreasonable to remove them from the model.

lmInfant2 <- lm(Infantmort ~ . -State -TeenDeaths -TeenBirths -HSdrop, data = InfantMortality)
summary(lmInfant2)
## 
## Call:
## lm(formula = Infantmort ~ . - State - TeenDeaths - TeenBirths - 
##     HSdrop, data = InfantMortality)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.587 -0.442 -0.138  0.435  2.435 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.43612    0.71027    2.02  0.04890 *  
## CDR          0.03378    0.00814    4.15  0.00014 ***
## lowBW        0.64630    0.10239    6.31    9e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.757 on 47 degrees of freedom
## Multiple R-squared:  0.689,  Adjusted R-squared:  0.676 
## F-statistic: 52.1 on 2 and 47 DF,  p-value: 1.2e-12

Notice that the \(R^2\) of this final model is 0.689: this is different that in the book. This illustrates how there are a number of approaches when choosing the ‘best’ model and that different models may be better or worse given the constraints/specifications of the problem.