--- title: "IS5 in R: Regression Wisdom (Chapter 8)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "December 19, 2020" output: pdf_document: fig_height: 3 fig_width: 5 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include = FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages library(mosaic) library(readr) library(janitor) ``` ```{r, include = FALSE} # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy = FALSE, # display code as typed size = "small" # slightly smaller font for code ) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. ## Chapter 8: Regression Wisdom ```{r} library(mosaic) library(readr) library(janitor) ``` ### Section 8.1: Examining Residuals #### Getting the "Bends": When the Residuals Aren't Straight We begin by reading in the data. ```{r message = FALSE} Penguins <- read_csv("http://nhorton.people.amherst.edu/is5/data/Penguins.csv") %>% janitor::clean_names() ``` By default, `read_csv()` prints the variable names. These messages have been suppressed using the `message=FALSE` code chunk option to save space and improve readability. Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). ```{r } # Figure 8.1, page 234 gf_point(dive_heart_rate ~ duration_min, data = Penguins) %>% gf_labs(x = "Duration (min)", y = "Dive Heart Rate (bpm)") penguinlm <- lm(dive_heart_rate ~ duration_min, data = Penguins) # Figure 8.2 gf_point(resid(penguinlm) ~ duration_min, data = Penguins) %>% gf_labs(x = "Duration (min)", y = "Residuals") ``` #### Sifting Residuals for Groups We begin by reading in the data. ```{r message = FALSE} Cereal <- read_csv("http://nhorton.people.amherst.edu/is5/data/Cereals.csv") cereallm <- lm(calories ~ sugars, data = Cereal) # Figure 8.3, page 235 gf_histogram(~ resid(cereallm), binwidth = 7.5, center = 7.5 / 2) %>% gf_labs(x = "Residuals (calories)", y = "# of Cereals") Cereal <- Cereal %>% mutate( residuals = resid(cereallm), fitted = fitted(cereallm) ) %>% mutate(resid_cat = ifelse(residuals >= -30 & residuals <= 25, "mid", ifelse(residuals > 25, "high", "low") )) # For color categories # Figure 8.4 gf_point(jitter(residuals) ~ jitter(fitted), color = ~resid_cat, data = Cereal) ``` Jitter adds some random noise to allow easier observation of values that are shared by more than one type of breakfast cereal. ```{r} tally(~shelf, data = Cereal) Cereal <- mutate(Cereal, shelfgroup = recode(shelf, `1` = "Bottom Shelf", `2` = "Middle Shelf", `3` = "Top Shelf" ) ) ``` The `recode()` function allows for efficient mutation of levels. ```{r} tally(~shelfgroup, data = Cereal) # Figure 8.5 gf_point(jitter(calories) ~ sugars, color = ~shelfgroup, data = Cereal) %>% gf_lm() %>% gf_labs(x = "Sugar (g)", y = "Calories") ``` ### Section 8.2: Extrapolation: Reaching Beyond the Data See displays on page 237 and 238. #### Example 8.1: Extrapolation: Reaching Beyond the Data We begin by reading in the data. ```{r message = FALSE} MarriageAge <- read_csv("http://nhorton.people.amherst.edu/is5/data/Marriage_age_2015.csv") MarriageAge <- MarriageAge %>% mutate(timeperiod = ifelse(Year <= 1940, "Section1", ifelse(Year >= 1980, "Section3", "Section2"))) gf_point(Men ~ Year, color = ~timeperiod, data = MarriageAge, ylab = "Men's Age at First Marriage") + guides(color = FALSE) ``` The `ifelse()` function works similarly to `recode()` when only two levels exist. ### Section 8.3: Outliers, Leverage, and Influence We begin by reading in the data. ```{r message = FALSE} Election2000 <- read_csv("http://nhorton.people.amherst.edu/is5/data/Election_2000.csv") withlm <- lm(Buchanan ~ Nader, data = Election2000) msummary(withlm) withoutlm <- lm(Buchanan ~ Nader, data = filter(Election2000, Buchanan <= 3000)) msummary(withoutlm) # Figure 8.10, page 241 gf_point(Buchanan ~ Nader, data = Election2000) %>% gf_lm(color = "red") %>% gf_labs(x = "Nader (votes)", y = "Buchanan (votes)") %>% gf_fun(withoutlm, color = "blue") # adds line for model without outlier ``` See page 242 for example of high-leverage point. ### Section 8.4: Lurking Variables with Causation The Doctors data can help with thinking in a multivariate way. ```{r message = FALSE, warning = FALSE} Doctors <- read_csv("http://nhorton.people.amherst.edu/is5/data/Doctors_and_life_expectancy.csv") %>% janitor::clean_names() # Figure 8.13, page 243 gf_point(life_exp ~ sqrt_doctors_person, data = Doctors) %>% gf_labs(x = "Sqrt (Doctors/Person)", y = "Life Expectancy (yr)") # Figure 8.14 gf_point(life_exp ~ sqrt_tv_person, data = Doctors) %>% gf_labs(x = "Sqrt (TVs/Person)", y = "Live Expectancy (yr)") ``` #### Example 8.2: Using Several of These Methods Together We can use these approaches together, as seen in the following example. ```{r message = FALSE} # page 244 DirtBikes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Dirt_bikes_2014.csv") gf_point(MSRP ~ Displacement, data = DirtBikes) bikeslm <- lm(MSRP ~ Displacement, data = DirtBikes) gf_point(resid(bikeslm) ~ fitted(bikeslm)) %>% gf_labs(x = "Predicted Values", y = "Residuals") ``` ```{r, fig.width=7} DirtBikes %>% filter(Cooling != "NA") %>% mutate(Cooling = ifelse(Cooling == "Air-Cooled", "Air-Cooled", "LiquidCooled")) %>% gf_point(MSRP ~ (Displacement)^(1 / 3), color = ~Cooling) %>% gf_lm() ``` ### Section 8.5: Working with Summary Values We can also work with summary values. ```{r message = FALSE} HeightsWeights <- read_csv("http://nhorton.people.amherst.edu/is5/data/Heights_and_weights.csv") # Figure 8.15, page 246 gf_point(Weight ~ Height, data = HeightsWeights) %>% gf_lm() # Figure 8.16 HeightWeightSum <- df_stats(Weight ~ Height, data = HeightsWeights) head(HeightWeightSum) gf_point(mean ~ Height, data = HeightWeightSum) %>% gf_lm() %>% gf_labs(x = "Height", y = "Mean Weights") ``` Alternately, we can use `group_by()` and `summarise()` together to find summary values of data: ```{r} HeightsWeights %>% group_by(Height) %>% summarise(MeanWeight = mean(Weight), .groups = "drop") %>% gf_point(MeanWeight ~ Height) %>% gf_lm() ``` ### Section 8.6: Straightening Scatterplots--The Three Goals We begin by reading in the data. ```{r message = FALSE} FuelEfficiency <- read_csv("http://nhorton.people.amherst.edu/is5/data/Fuel_efficiency.csv") %>% janitor::clean_names() # Figure 8.17 gf_point(city_mpg ~ weight, data = filter(FuelEfficiency, city_mpg <= 40)) %>% gf_labs(x = "Weight (lb)", y = "Fuel Efficiency (mpg)") fuellm <- lm(city_mpg ~ weight, data = filter(FuelEfficiency, city_mpg <= 40)) gf_point(resid(fuellm) ~ fitted(fuellm)) %>% gf_labs(x = "Predicted (mpg)", y = "Residuals (mpg)") ``` ```{r warning = FALSE} FuelEfficiency <- FuelEfficiency %>% mutate(fuel_consumption = (1 / city_mpg) * 100) # Figure 8.19, page 247 gf_point(fuel_consumption ~ weight, data = FuelEfficiency) %>% gf_labs(x = "Weight (lb)", y = "Fuel Consumption (gal/100 mi)") fuellm2 <- lm(fuel_consumption ~ weight, data = FuelEfficiency) gf_point(resid(fuellm2) ~ fitted(fuellm2)) %>% gf_labs(x = "Predicted (gal/100 mi)", y = "Residuals (gal/100 mi)") ``` #### Goals of Re-Expression for Regression We use the hand dexterity data to illustrate re-expressions. ```{r, fig.width=7, message = FALSE, warning = FALSE} HandDexterity <- read_csv("http://nhorton.people.amherst.edu/is5/data/Hand_dexterity.csv") %>% janitor::clean_names() %>% mutate(dominant = ifelse(dominant == 0, "Dominant hand", "Non-dominant hand")) %>% mutate(dominant = as.factor(dominant)) # Figure 8.20, page 248 gf_point(time_sec ~ age_yr, color = ~dominant, data = HandDexterity) %>% gf_labs(x = "Age (yr)", y = "Time (sec)", color = "") HandDexterity <- HandDexterity %>% mutate(speed = 1 / time_sec) # Figure 8.21 gf_point(speed ~ age_yr, color = ~dominant, data = HandDexterity) %>% gf_lm() %>% gf_labs(x = "Age (yr)", y = "Speed", color = "") ``` ```{r} handlm <- lm(speed ~ age_yr, data = HandDexterity) # Figure 8.22 gf_histogram(~ resid(handlm), binwidth = .0075, center = .0075 / 2) %>% gf_labs(x = "Residuals", y = "") ``` ### Section 8.7: Finding a Good Re-expression See the table and Figure 8.23 on page 250. #### Step-By-Step Example: Re-Expressing to Straighten a Scatterplot We can explore different re-expressions. ```{r} gf_point(dive_heart_rate ~ duration_min, data = Penguins, xlab = "Duration", ylab = "DHR") gf_point(resid(penguinlm) ~ fitted(penguinlm), data = Penguins) %>% gf_labs(x = "Predicted Values", y = "Residuals") gf_histogram(~ resid(penguinlm), binwidth = 10, center = 5) %>% gf_labs(x = "Residuals", y = "# of Residuals") ``` Mutating with the square root: ```{r} gf_point((dive_heart_rate)^(1 / 2) ~ duration_min, data = Penguins) %>% gf_labs(x = "Duration", y = "sqrt(DHR)") ``` Mutating with a log: ```{r} gf_point(log(dive_heart_rate) ~ duration_min, data = Penguins) %>% gf_labs(x = "Duration", y = "Log (DHR)") penguinlm2 <- lm(log(dive_heart_rate) ~ duration_min, data = Penguins) gf_point(resid(penguinlm2) ~ fitted(penguinlm2)) %>% gf_labs(x = "Predicted Values", y = "Residuals") gf_histogram(~ resid(penguinlm2), binwidth = 0.1, center = .025) %>% gf_labs(x = "Residuals", y = "# of Residuals") ``` Mutating with a (negative) reciprocal: ```{r} gf_point(-1 / (dive_heart_rate) ~ duration_min, data = Penguins) %>% gf_labs(x = "Duration", y = "-1/DHR") penguinlm3 <- lm(-1 / (dive_heart_rate) ~ duration_min, data = Penguins) gf_point(resid(penguinlm3) ~ fitted(penguinlm3)) %>% gf_labs(x = "Predicted Values", y = "Residuals") gf_histogram(~ resid(penguinlm3), binwidth = 0.0025, center = 0.00125) %>% gf_labs(x = "Residuals", y = "# of Residuals") ``` The $-1/\sqrt{DHR}$ transformation follows the same process on pages 253-254.