--- title: "IS5 in R: Inferences for Regression (Chapter 20)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "December 13, 2020" output: pdf_document: fig_height: 3 fig_width: 5 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include = FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages library(mosaic) library(readr) library(janitor) ``` ```{r, include = FALSE} # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy = FALSE, # display code as typed size = "small" # slightly smaller font for code ) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. ## Chapter 20: Inferences for Regression ```{r message = FALSE} library(mosaic) library(readr) library(janitor) BodyFat <- read_csv("http://nhorton.people.amherst.edu/is5/data/Bodyfat.csv") %>% janitor::clean_names() ``` By default, `read_csv()` prints the variable names. These messages have been suppressed using the `message=FALSE` code chunk option to save space and improve readability. Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). ```{r} # Figure 20.1, page 642 gf_point(pct_bf ~ waist, data = BodyFat) %>% gf_smooth() %>% # to show linear relationship gf_labs(x = "Waist (in.)", y = "% Body Fat") ``` ### Section 20.1: The Regression Model ```{r} lm(pct_bf ~ waist, data = BodyFat) ``` ```{r} # Figure 20.2 gf_histogram(~pct_bf, data = BodyFat, binwidth = 2.5, center = 1.25) %>% gf_labs(x = "% Body Fat", y = "# of Men") # Figure 20.3 (reinterpreted with points) BodyFat <- BodyFat %>% mutate(roundedwaist = cut(waist, breaks = c(0, 30, 35, 40, 45, Inf), labels = c(1:5))) gf_point(pct_bf ~ jitter(as.numeric(roundedwaist)), data = BodyFat) %>% gf_lm() %>% gf_labs(y = "% Body Fat", x = "Waist") ``` #### Random Matters: Slopes Vary ```{r} numsamp <- 25 # It's too messy to do any more than 25 slopesdata <- do(numsamp) * lm(pct_bf ~ waist, data = resample(BodyFat)) ``` For more information about `resample()`, refer to the [resample vignette in mosaic](https://cran.r-project.org/web/packages/mosaic/vignettes/Resampling.pdf). ```{r fig.height = 8, fig.width = 7} slopesdata <- slopesdata %>% mutate(at27 = Intercept + waist * 27, at50 = Intercept + waist * 50, color = as.factor(1:25)) # Figure 20.4, page 644 gf_point(pct_bf ~ waist, data = BodyFat) %>% gf_segment(at27 + at50 ~ 27 + 50, data = slopesdata, color = ~color) %>% gf_labs(color = "Sample", x = "Waist (in.)", y = "Percent Body Fat") ``` ```{r} numsamp <- 1000 # To see the shape of the histogram slopesdata <- do(numsamp) * lm(pct_bf ~ waist, data = resample(BodyFat)) # Figure 20.5 gf_histogram(~waist, data = slopesdata, binwidth = .05, center = .025) %>% gf_labs(x = "Slope", y = "# of Trials") ``` For the histogram, we use 1,000 trials. ### Section 20.2: Assumptions and Conditions ```{r} # Figure 20.6 is the same as Figure 20.1 # Figure 20.7 (page 645) bodyfatlm <- lm(pct_bf ~ waist, data = BodyFat) gf_point(resid(bodyfatlm) ~ fitted(bodyfatlm)) %>% gf_labs(x = "Predicted", y = "Residuals") ``` ```{r message = FALSE} Diamonds <- read_csv("http://nhorton.people.amherst.edu/is5/data/Diamonds.csv") %>% janitor::clean_names() ``` Here we fit `price` by `carat_size` for diamonds with the color `E`. ```{r} diamondlm <- lm(price ~ carat_size, data = filter(Diamonds, color == "E")) # Figure 20.8, page 646 gf_point(resid(diamondlm) ~ fitted(diamondlm)) %>% gf_labs(x = "Predicted Values", y = "Residuals") # Figure 20.9 gf_histogram(~ resid(bodyfatlm), binwidth = 2, center = 1) %>% gf_labs(x = "Residuals", y = "Count") ``` Figure 20.10 is intended to convey the same idea as Figure 20.3 (page 643). #### Example 20.1: Checking Assumptions and Conditions Note that points are removed to match the results in the textbook! ```{r message=FALSE} Craters <- read_csv("http://nhorton.people.amherst.edu/is5/data/Craters.csv") %>% janitor::clean_names() %>% filter(log_age <= 1.5) # Removed points to match the textbook ``` ```{r} gf_point(log_diam ~ log_age, data = Craters) %>% gf_lm() %>% gf_labs(x = "Log(Age)", y = "Log(Diameter)") craterlm <- lm(log_diam ~ log_age, data = Craters) gf_point(resid(craterlm) ~ fitted(craterlm)) %>% gf_lm() %>% gf_labs(x = "Predicted", y = "Residuals") gf_qq(~ resid(craterlm)) %>% gf_qqline(linetype = "solid", color = "red") %>% gf_labs(x = "Nscores", y = "Residuals") gf_histogram(~ resid(craterlm), binwidth = .5, center = 0.25) %>% gf_labs(x = "Residuals", y = "# of Years") ``` #### Step-By-Step Example: Regression Inference The following scatterplot matches Figure 20.1. ```{r} gf_qq(~ resid(bodyfatlm)) %>% gf_qqline(linetype = "solid", color = "red") %>% gf_labs(x = "Normal Scores", y = "Residuals (% body fat)") msummary(bodyfatlm) ``` ### Section 20.3: Regression Inference and Intuition See the displays on pages 650 and 651. #### Example 20.2: Confidence Interval and Hypothesis Test for a Slope ```{r} msummary(bodyfatlm) mean <- 1.70 se <- .074 tstats <- qt(p = c(.025, .975), df = 248) tstats mean + tstats * se t <- (mean - 0.00) / se t ``` ### Section 20.4: The Regression Table ```{r} # Table 20.1, page 654 msummary(bodyfatlm) ``` ### Section 20.5: Multiple Regression Inference ```{r} # Table 20.2, page 655 bodyfatmlm <- lm(pct_bf ~ waist + height, data = BodyFat) msummary(bodyfatmlm) ``` #### Just Checking ```{r} Mouth <- read_csv("http://nhorton.people.amherst.edu/is5/data/Mouth_volume.csv") mouthlm <- lm(Mouth_Volume ~ Height, data = Mouth) # simple linear model df_stats(~Mouth_Volume, data = Mouth) msummary(mouthlm) mouthmlm <- lm(Mouth_Volume ~ Age + Height, data = Mouth) # multiple linear model msummary(mouthmlm) ``` #### Collinearity ```{r} Coasters <- read_csv("http://nhorton.people.amherst.edu/is5/data/Coasters_2015.csv") Coasters <- Coasters %>% filter(Name != "Tower of Terror", Name != "Xcelerator") %>% # Removed artificially accelerated coasters and Tower of Terror filter(Drop != "NA", Duration != "NA") %>% mutate(Inversions = as.factor(Inversions)) coasterlm <- lm(Duration ~ Drop, data = Coasters) # simple linear model msummary(coasterlm) coastermlm <- lm(Duration ~ Drop + Speed, data = Coasters) # multiple linear regression model msummary(coastermlm) ``` ```{r} gf_point(Speed ~ Drop, data = Coasters) %>% gf_lm() ``` ### Section 20.6: Confidence and Prediction Intervals ```{r} # Figure 20.16, page 659 gf_point(pct_bf ~ waist, data = BodyFat) %>% gf_lm(interval = "confidence", fill = "red") %>% gf_lm(interval = "prediction", fill = "navy") %>% gf_labs(x = "Waist (in.)", y = "% of Body Fat") ``` ### Section 20.7: Logistic Regression ```{r} PimaIndians <- read_csv("http://nhorton.people.amherst.edu/is5/data/Pima_indians.csv") PimaIndians <- PimaIndians %>% filter(BMI != 0) # Figure 20.17, page 661 PimaIndians %>% mutate(Diabetes = ifelse(Diabetes == 1, "Diabetic", "Not Diabetic")) %>% gf_boxplot(BMI ~ as.factor(Diabetes), xlab = "Diabetes (1 = yes)") # Figure 20.21, page 663 gf_point(Diabetes ~ BMI, data = PimaIndians, ylab = "Diabetes (1 = yes)") %>% gf_smooth(method = "glm", method.args = list(family = "binomial")) ``` ### Section 20.8: More About Regression