--- title: "SDM4 in R: Multiple Regression (Chapter 28)" author: "Nicholas Horton (nhorton@amherst.edu) and Sarah McDonald" date: "June 21, 2018" output: pdf_document: fig_height: 3 fig_width: 6 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include = FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages require(mosaic) require(GGally) ``` ```{r, include = FALSE} # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy = FALSE, # display code as typed size = "small" # slightly smaller font for code ) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fourth Edition of *Stats: Data and Models* (2014) by De Veaux, Velleman, and Bock. More information about the book can be found at http://wps.aw.com/aw_deveaux_stats_series. This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/sdm4. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. ## Chapter 28: Multiple Regression ### Section 28.1: What is multiple regression? The table on page 818 displays the results from the multiple regression model. ```{r message = FALSE} library(mosaic) library(readr) options(digits = 3) BodyFat <- read_csv("http://nhorton.people.amherst.edu/sdm4/data/Body_fat_complete.csv") BodyFatmod <- lm(PctBF ~ waist + Height, data = BodyFat) msummary(BodyFatmod) ``` We can use this model to generate predicted values. ```{r} BodyFatfun <- makeFun(BodyFatmod) BodyFatfun(waist = 0, Height = 0) # returns intercept BodyFatfun(waist = 30, Height = 70) -3.101 + 1.773*30 - 0.602*70 ``` ### Section 28.2: Interpreting multiple regression coefficients Figure 28.1 on page 819 displays the scatterplot of percent body fat against height. ```{r} gf_point(PctBF ~ Height, data = BodyFat) %>% gf_lm() ``` Figure 28.2 (page 820) displays the scatterplot for a subset of the data (men with waist sizes between 36 and 38 inches). ```{r} gf_point(PctBF ~ Height, data = filter(BodyFat, waist > 36, waist < 38)) %>% gf_lm() ``` Figure 28.3 (page 820) displays the partial regression plot for weight. ```{r} BodyFatwaist <- lm(PctBF ~ waist, data = BodyFat) BodyFatheight <- lm(Height ~ waist, data = BodyFat) gf_point(resid(BodyFatwaist) ~ resid(BodyFatheight), ylab = "% body fat residuals", xlab = "Height residuals") %>% gf_lm() ``` ### Section 28.3: The multiple regression model (assumptions and conditions) Figure 28.4 (page 822) displays scatterplots of residuals vs. height and waist, respectively. ```{r} gf_point(resid(BodyFatmod) ~ Height, data = BodyFat) %>% gf_lm() gf_point(resid(BodyFatmod) ~ waist, data = BodyFat) %>% gf_lm() ``` Figure 28.5 (page 823) displays histogram and qq plot of the residuals. ```{r} gf_dhistogram(~ resid(BodyFatmod), data = BodyFat) %>% gf_fitdistr(dist = "dnorm") gf_qq(~ resid(BodyFatmod)) ``` ### Section 28.4: Multiple regression inference Figure 28.6 (page 829) displays the scatterplot matrix infant mortality data. ```{r, fig.height = 6, message = FALSE} InfantMortality <- read_csv("http://nhorton.people.amherst.edu/sdm4/data/Infant_Mortality.csv") pairs(select(InfantMortality, - State)) ``` In addition, we display a scatterplot matrix for the motivating example from the chapter (BodyFat) using the `GGally` package. ```{r, message = FALSE, fig.height = 6} subsetBodyFat <- select(BodyFat, PctBF, Height, waist) library(GGally) ggpairs(subsetBodyFat) ``` ### Section 28.5: Comparing multiple regression models We may want to compare which of our models provides the most parsimonious fit to these data. ```{r} msummary(BodyFatheight) msummary(BodyFatwaist) msummary(BodyFatmod) ``` The adjusted R-squared value of 0.711 is considerably higher for the model with both predictors (though the model with just waist has an adjusted R-squared value of 0.677).