--- title: "IS5 in R: Multiple Regression (Chapter 9)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "December 17, 2020" output: pdf_document: fig_height: 3 fig_width: 5 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include = FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages library(mosaic) library(readr) library(janitor) ``` ```{r, include = FALSE} # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy = FALSE, # display code as typed size = "small" # slightly smaller font for code ) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. ## Chapter 9: Multiple Regression ```{r message = FALSE} library(mosaic) library(readr) library(janitor) library(broom) # We'll use this for augment() later BodyFat <- read_csv("http://nhorton.people.amherst.edu/is5/data/Bodyfat.csv") %>% janitor::clean_names() ``` By default, `read_csv()` prints the variable names. These messages have been suppressed using the `message=FALSE` code chunk option to save space and improve readability. Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). ```{r} # Figure 9.1, page 276 gf_point(pct_bf ~ waist, data = BodyFat) %>% gf_labs(x = "Waist", y = "% Body Fat") %>% gf_smooth() ``` We've added `gf_smooth()` to demonstrate how to add a smoother. ### Section 9.1: What is Multiple Regression? ```{r} # Table 9.1, page 277 multiplereg <- lm(pct_bf ~ waist + height, data = BodyFat) summary(multiplereg) ``` The `summary()` function provides the multiple R-squared along with the regression coefficients. #### Example 9.1: Modeling Home Prices ```{r message = FALSE} RealEstate <- read_csv("http://nhorton.people.amherst.edu/is5/data/Real_Estate.csv") %>% janitor::clean_names() realestatelm <- lm(price ~ living_area + bedrooms, data = RealEstate) summary(realestatelm) ``` ```{r} # Predicted Values realestatefn <- makeFun(realestatelm) # Making a function to find predicted values # Predicted price for a home with 2800 sq ft living area and 5 bedrooms realestatefn(living_area = 2800, bedrooms = 5) # Predicted price for a home with 2801 sq ft living area and 5 bedrooms realestatefn(living_area = 2801, bedrooms = 5) # If we subtract predicted values one value apart, we get the slope realestatefn(living_area = 2801, bedrooms = 5) - realestatefn(living_area = 2800, bedrooms = 5) ``` ### Section 9.2: Interpreting Multiple Regression Coefficients ```{r message = FALSE} # Figure 9.2, page 279 gf_point(pct_bf ~ height, data = BodyFat) %>% gf_smooth() %>% # Added a smoother to assess linearity gf_labs(x = "Height (in)", y = "% Body Fat") ``` A message about the default smoother option was suppressed by adding `message = FALSE` as a code chunk option. ```{r} # Figure 9.3 BodyFat %>% filter(waist >= 36 & waist <= 38) %>% # Just plotting waist sizes between 36 and 38 inches gf_point(pct_bf ~ height) %>% gf_labs(x = "Height (in)", y = "% Body Fat") %>% gf_lm() ``` ```{r, fig.width=7} # Plotting all points BodyFat %>% mutate(waistsize = ifelse(waist >= 36 & waist <= 38, "Between 36 and 38 inches", "Not Between 36 and 38 inches" )) %>% # Subsetting gf_point(pct_bf ~ height, shape = ~waistsize, color = ~waistsize) %>% gf_labs(x = "Height (in)", y = "% Body Fat", shape = "Waist size", color = "Waist size") %>% gf_lm() ``` ### Section 9.3: The Multiple Regression Model--Assumptions and Conditions #### Linearity Assumption #### Equal Variance Assumption We can assess the equal variance assumption in several ways. The simplest is through a scatterplot of residuals vs. fitted values. ```{r} bodyfatlm <- lm(pct_bf ~ waist + height, data = BodyFat) # Figure 9.4, page 282 gf_point(resid(bodyfatlm) ~ fitted(bodyfatlm)) %>% gf_lm() %>% gf_labs(x = "Predicted", y = "Residuals") ``` #### Check the Residuals It's important to look at the residuals to see if the "Nearly Normal" condition is reasonable to assume. ```{r} # Figure 9.5 gf_histogram(~ resid(bodyfatlm), binwidth = 1.5, center = 0.75) %>% gf_labs(x = "Residuals", y = "Counts") gf_qq(~ resid(bodyfatlm)) %>% gf_qqline(linetype = "solid", color = "red") %>% gf_labs(x = "Normal Scores", y = "Residuals (% Body Fat)") ``` #### Step-By-Step Example: Multiple Regression We begin by reading in the data for the step-by-step example. ```{r message = FALSE} HousingPrices <- read_csv("http://nhorton.people.amherst.edu/is5/data/Housing_prices.csv") %>% janitor::clean_names() gf_point(price ~ living_area, data = HousingPrices) %>% gf_smooth() %>% gf_labs(x = "Living Area", y = "Price") gf_boxplot(price ~ as.factor(bedrooms), data = HousingPrices) %>% gf_labs(x = "Bedrooms", y = "Price") ``` ```{r} gf_point(price ~ age, data = HousingPrices) %>% gf_smooth() %>% gf_labs(x = "Age", y = "Price") gf_point(price ~ log(age + 1), data = HousingPrices) %>% gf_smooth() %>% gf_labs(x = "LogAge", y = "Price") housinglm <- lm(price ~ log(age + 1), data = HousingPrices) gf_point(resid(housinglm) ~ fitted(housinglm)) %>% gf_smooth() %>% gf_labs(x = "Predicted Values", y = "Residuals") housinglm2 <- lm(price ~ living_area + log(age + 1) + bedrooms, data = HousingPrices) msummary(housinglm2) ``` ### Section 9.4: Partial Regression Plots ```{r} # Figure 9.6 (instructions on 287) # Step 1 otherthanheightlm <- lm(pct_bf ~ waist, data = BodyFat) # Step 2 residualsoflm <- resid(otherthanheightlm) # Step 3 yheightlm <- lm(height ~ waist, data = BodyFat) # Step 4 residualsoflm2 <- resid(yheightlm) # Step 5 gf_point(residualsoflm ~ residualsoflm2) %>% gf_lm() %>% gf_labs(x = "Height Residuals (in.)", y = "% Body Fat Residuals") ``` #### Just Checking ```{r message = FALSE} Hurricanes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Hurricanes_2015.csv") %>% janitor::clean_names() hurricanelm <- lm(max_wind_speed_kts ~ year + central_pressure_mb, data = Hurricanes) msummary(hurricanelm) ``` ### Section 9.5: Indicator Variables ```{r coaster, message = FALSE} Coasters <- read_csv("http://nhorton.people.amherst.edu/is5/data/Coasters_2015.csv") # Table 9.2, page 288 head(Coasters) # Figure 9.7 # Tower of Terror isn't included by the book Coasters <- Coasters %>% filter(Name != "Tower of Terror") %>% mutate(Inversions = as.factor(Inversions)) ``` ```{r warning = FALSE} gf_point(Duration ~ Drop, data = Coasters) %>% gf_lm() ``` ```{r error} coasterlm <- lm(Duration ~ Drop, data = Coasters) gf_point(resid(coasterlm) ~ fitted(coasterlm)) %>% gf_labs(x = "Predicted", y = "Residuals") msummary(coasterlm) # Figure 9.8 gf_point(Duration ~ Drop, color = ~Inversions, data = Coasters) %>% gf_lm() %>% gf_labs(color = "Inversions") ``` ```{r} coasterlm2 <- lm(Duration ~ Drop + Inversions, data = Coasters) msummary(coasterlm2) ``` ```{r} coasterlm2asdata <- augment(coasterlm2) glance(coasterlm2) %>% data.frame() ``` ```{r} gf_point(.resid ~ .fitted, color = ~Inversions, data = coasterlm2asdata) ``` The `augment()` function creates a data frame from a linear model that includes a column for residuals, fitted values, etc. Here we use `names()` to check out the column names and `glance()` to view the structure of the data set. #### Example 9.3: Using Indicator Variables We can explore the use of indicator variables to model categorical variables. ```{r message = FALSE} DirtBikes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Dirt_bikes_2014.csv") DirtBikes <- DirtBikes %>% filter(Cooling != "NA") %>% mutate(Cooling = ifelse(Cooling == "Air-Cooled", "Air-Cooled", "LiquidCooled")) gf_point(MSRP ~ (Displacement)^(1 / 3), color = ~Cooling, data = DirtBikes) %>% gf_lm() bikeslm <- lm(MSRP ~ I(Displacement^(1 / 3)) + Cooling, data = DirtBikes) msummary(bikeslm) ``` The `I()` function is used to keep the class of an object the same. Here we use it to keep the variable `Displacement` "as is" to prevent an error. #### Adjusting for Different Slopes We can fit a model with different slopes. ```{r message = FALSE} BurgerKing <- read_csv("http://nhorton.people.amherst.edu/is5/data/Burger_King_items.csv") %>% clean_names() # Figure 9.9, page 292 gf_point(calories ~ carbs_g, data = BurgerKing) %>% gf_labs(x = "Carbs (g)", y = "Calories") # Figure 9.10 gf_point(calories ~ carbs_g, color = ~ as.factor(meat), data = BurgerKing) %>% gf_labs(x = "Carbs (g)", y = "Calories", color = "Meat") %>% gf_lm() msummary(lm(calories ~ carbs_g * as.factor(meat), data = BurgerKing)) ``` #### One, Two, Many We can also consider three level variables. ```{r cereal1, message = FALSE} Cereal <- read_csv("http://nhorton.people.amherst.edu/is5/data/Cereals.csv") cereallm <- lm(sugars ~ sodium + as.factor(shelf), data = Cereal) gf_point(sugars ~ sodium, color = ~ as.factor(shelf), data = Cereal) %>% gf_lm() %>% gf_labs(x = "Sodium", y = "Sugars", color = "Shelf") msummary(cereallm) ``` #### Example 9.4: Indicators for Variables with Several Levels We will read in the diamonds data. ```{r diamonds, message = FALSE, warning = FALSE} Diamonds <- read_csv("http://nhorton.people.amherst.edu/is5/data/Diamonds.csv") %>% clean_names() # Parallel Slopes diamondlm <- lm(sqrt(price) ~ carat_size + color, data = Diamonds) msummary(diamondlm) diamondpredict <- makeFun(diamondlm) diamonddata <- augment(diamondlm) %>% # To get fitted values clean_names() str(diamonddata) gf_point(sqrt_price ~ carat_size, color = ~color, data = diamonddata) %>% gf_line(fitted ~ carat_size) %>% gf_labs(x = "Carat Size", y = "sqrt(Price)") + ylim(30, 100) ``` ```{r warning = FALSE} # With interaction diamondlm2 <- lm(sqrt(price) ~ carat_size * color, data = Diamonds) msummary(diamondlm2) gf_point(sqrt(price) ~ carat_size, color = ~color, data = Diamonds) %>% gf_lm() %>% gf_labs(x = "Carat Size", y = "sqrt(Price)") + ylim(30, 100) ```