--- title: "IS5 in R: Multiple Regression (Chapter 9)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "2025-01-20" date-format: iso format: pdf toc: true editor: source --- ```{r} #| label: setup #| include: false library(mosaic) library(tidyverse) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. We begin by loading packages that will be required for our analyses. ```{r} library(mosaic) library(tidyverse) library(broom) # We'll use this for augment() later ``` ## Chapter 9: Multiple Regression ```{r} #| message: false BodyFat <- read_csv("http://nhorton.people.amherst.edu/is5/data/Bodyfat.csv") |> janitor::clean_names() ``` By default, `read_csv()` prints the variable names. These messages have been suppressed using the `message: false` code chunk option to save space and improve readability. Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). ```{r} #| message: false # Figure 9.1, page 276 gf_point(pct_bf ~ waist, data = BodyFat) |> gf_labs(x = "Waist", y = "% Body Fat") |> gf_smooth() ``` We've added `gf_smooth()` to demonstrate how to add a smoother. ### Section 9.1: What is Multiple Regression? ```{r} # Table 9.1, page 277 multiplereg <- lm(pct_bf ~ waist + height, data = BodyFat) summary(multiplereg) ``` The `summary()` function provides the multiple R-squared along with the regression coefficients. ```{r} msummary(multiplereg) ``` The `msummary()` function in the `mosaic` package provides a pruned version of the same output. ```{r} broom::tidy(multiplereg) ``` The `tidy()` function in the `broom` package provides similar information as a tibble/dataframe. #### Example 9.1: Modeling Home Prices ```{r} #| message: false RealEstate <- read_csv("http://nhorton.people.amherst.edu/is5/data/Real_Estate.csv") |> janitor::clean_names() realestatelm <- lm(price ~ living_area + bedrooms, data = RealEstate) summary(realestatelm) ``` Here we demonstrate how to create a function in R that can be used to calculate predicted values from a regression model. ```{r} # Predicted Values realestatefn <- makeFun(realestatelm) # Making a function to find predicted values # Predicted price for a home with 2800 sq ft living area and 5 bedrooms realestatefn(living_area = 2800, bedrooms = 5) # Predicted price for a home with 2801 sq ft living area and 5 bedrooms realestatefn(living_area = 2801, bedrooms = 5) # If we subtract predicted values one value apart, we get the slope realestatefn(living_area = 2801, bedrooms = 5) - realestatefn(living_area = 2800, bedrooms = 5) ``` ### Section 9.2: Interpreting Multiple Regression Coefficients ```{r} #| message: false # Figure 9.2, page 279 gf_point(pct_bf ~ height, data = BodyFat) |> gf_smooth() |> # Added a smoother to assess linearity gf_labs(x = "Height (in)", y = "% Body Fat") ``` A message about the default smoother option was suppressed by adding `message: false` as a code chunk option. ```{r} #| warning: false # Figure 9.3 BodyFat |> filter(waist >= 36 & waist <= 38) |> # Just plotting waist sizes between 36 and 38 inches gf_point(pct_bf ~ height) |> gf_labs(x = "Height (in)", y = "% Body Fat") |> gf_lm() ``` ```{r, fig.width=7} # Plotting all points BodyFat |> mutate(waistsize = ifelse(waist >= 36 & waist <= 38, "Between 36 and 38 inches", "Not Between 36 and 38 inches" )) |> # Subsetting gf_point(pct_bf ~ height, shape = ~ waistsize, color = ~ waistsize) |> gf_labs( x = "Height (in)", y = "% Body Fat", shape = "Waist size", color = "Waist size" ) |> gf_lm() ``` ### Section 9.3: The Multiple Regression Model--Assumptions and Conditions #### Linearity Assumption #### Equal Variance Assumption We can assess the equal variance assumption in several ways. The simplest is through a scatterplot of residuals vs. fitted values. ```{r} bodyfatlm <- lm(pct_bf ~ waist + height, data = BodyFat) # Figure 9.4, page 282 gf_point(resid(bodyfatlm) ~ fitted(bodyfatlm)) |> gf_lm() |> gf_labs(x = "Predicted", y = "Residuals") ``` #### Check the Residuals It's important to look at the residuals to see if the "Nearly Normal" condition is reasonable. ```{r} # Figure 9.5 gf_histogram(~ resid(bodyfatlm), binwidth = 1.5, center = 0.75) |> gf_labs(x = "Residuals", y = "Counts") gf_qq(~ resid(bodyfatlm)) |> gf_qqline(linetype = "solid", color = "red") |> gf_labs(x = "Normal Scores", y = "Residuals (% Body Fat)") ``` #### Step-By-Step Example: Multiple Regression We begin by reading in the data for the step-by-step example. ```{r} #| message: false HousingPrices <- read_csv("http://nhorton.people.amherst.edu/is5/data/Housing_prices.csv") |> janitor::clean_names() gf_point(price ~ living_area, data = HousingPrices) |> gf_smooth() |> gf_labs(x = "Living Area", y = "Price") ``` For this and other plots the y axis labels would be far easier to read if the values were rescaled. Here we demonstrate this but continue to mirror the book output for the other displays. ```{r} #| message: false HousingRescaled <- HousingPrices |> mutate(price1000 = price / 1000) gf_point(price1000 ~ living_area, data = HousingRescaled) |> gf_smooth() |> gf_labs(x = "Living Area (sf)", y = "Price (in thousands)") ``` ```{r} #| message: false gf_boxplot(price ~ as.factor(bedrooms), data = HousingPrices) |> gf_labs(x = "Bedrooms", y = "Price") ``` ```{r} #| message: false gf_point(price ~ age, data = HousingPrices) |> gf_smooth() |> gf_labs(x = "Age", y = "Price") gf_point(price ~ log(age + 1), data = HousingPrices) |> gf_smooth() |> gf_labs(x = "LogAge", y = "Price") housinglm <- lm(price ~ log(age + 1), data = HousingPrices) gf_point(resid(housinglm) ~ fitted(housinglm)) |> gf_smooth() |> gf_labs(x = "Predicted Values", y = "Residuals") housinglm2 <- lm(price ~ living_area + log(age + 1) + bedrooms, data = HousingPrices) msummary(housinglm2) ``` ### Section 9.4: Partial Regression Plots ```{r} # Figure 9.6 (instructions on 287) # Step 1 otherthanheightlm <- lm(pct_bf ~ waist, data = BodyFat) # Step 2 residualsoflm <- resid(otherthanheightlm) # Step 3 yheightlm <- lm(height ~ waist, data = BodyFat) # Step 4 residualsoflm2 <- resid(yheightlm) # Step 5 gf_point(residualsoflm ~ residualsoflm2) |> gf_lm() |> gf_labs(x = "Height Residuals (in.)", y = "% Body Fat Residuals") ``` #### Just Checking ```{r} #| message: false Hurricanes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Hurricanes_2015.csv") |> janitor::clean_names() hurricanelm <- lm(max_wind_speed_kts ~ year + central_pressure_mb, data = Hurricanes) msummary(hurricanelm) ``` ### Section 9.5: Indicator Variables ```{r} #| label: coaster #| message: false Coasters <- read_csv("http://nhorton.people.amherst.edu/is5/data/Coasters_2015.csv") # Table 9.2, page 288 head(Coasters) # Figure 9.7 # Tower of Terror isn't included by the book, so we need to drop it Coasters <- Coasters |> filter(Name != "Tower of Terror") |> mutate(Inversions = as.factor(Inversions)) # turn the variable into a factor ``` ```{r} #| warning: false gf_point(Duration ~ Drop, data = Coasters) |> gf_lm() ``` ```{r error} coasterlm <- lm(Duration ~ Drop, data = Coasters) gf_point(resid(coasterlm) ~ fitted(coasterlm)) |> gf_labs(x = "Predicted", y = "Residuals") msummary(coasterlm) # Figure 9.8 gf_point(Duration ~ Drop, color = ~Inversions, data = Coasters) |> gf_lm() |> gf_labs(color = "Inversions") ``` Here it would be appropriate to add `warning: false` as a code chunk option once we've verified that there are indeed 150 observations missing ```{r} coasterlm2 <- lm(Duration ~ Drop + Inversions, data = Coasters) msummary(coasterlm2) ``` ```{r} coasterlm2asdata <- broom::augment(coasterlm2) # another helpful function broom::glance(coasterlm2) |> data.frame() ``` ```{r} gf_point(.resid ~ .fitted, color = ~ Inversions, data = coasterlm2asdata) ``` The `augment()` function from the `broom` package creates a data frame from a linear model that includes a column for residuals, fitted values, etc. Here we use `names()` to check out the column names and `glance()` to view the structure of the data set. #### Example 9.3: Using Indicator Variables We can explore the use of indicator variables to model categorical variables. ```{r} #| message: false DirtBikes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Dirt_bikes_2014.csv") DirtBikes <- DirtBikes |> filter(Cooling != "NA") |> mutate(Cooling = ifelse(Cooling == "Air-Cooled", "Air-Cooled", "LiquidCooled")) gf_point(MSRP ~ (Displacement)^(1 / 3), color = ~ Cooling, data = DirtBikes) |> gf_lm() bikeslm <- lm(MSRP ~ I(Displacement^(1 / 3)) + Cooling, data = DirtBikes) msummary(bikeslm) ``` The `I()` function is used to keep the class of an object the same. Here we use it to keep the variable `Displacement` "as is" to prevent an error. #### Adjusting for Different Slopes We can fit a model with different slopes. ```{r} #| message: false BurgerKing <- read_csv("http://nhorton.people.amherst.edu/is5/data/Burger_King_items.csv") |> janitor::clean_names() # Figure 9.9, page 292 gf_point(calories ~ carbs_g, data = BurgerKing) |> gf_labs(x = "Carbs (g)", y = "Calories") # Figure 9.10 gf_point(calories ~ carbs_g, color = ~ as.factor(meat), data = BurgerKing) |> gf_labs(x = "Carbs (g)", y = "Calories", color = "Meat") |> gf_lm() msummary(lm(calories ~ carbs_g * as.factor(meat), data = BurgerKing)) ``` The output here is a bit ugly: it would be straightforward to create the new variables using `mutate()` to provide easier to read output. #### One, Two, Many We can also consider three level variables. ```{r} #| label: cereal1 #| message: false Cereal <- read_csv("http://nhorton.people.amherst.edu/is5/data/Cereals.csv") cereallm <- lm(sugars ~ sodium + as.factor(shelf), data = Cereal) gf_point(sugars ~ sodium, color = ~ as.factor(shelf), data = Cereal) |> gf_lm() |> gf_labs(x = "Sodium", y = "Sugars", color = "Shelf") msummary(cereallm) ``` #### Example 9.4: Indicators for Variables with Several Levels We will read in the diamonds data. ```{r} #| label: diamonds #| message: false #| warning: false Diamonds <- read_csv("http://nhorton.people.amherst.edu/is5/data/Diamonds.csv") |> janitor::clean_names() # Parallel Slopes diamondlm <- lm(sqrt(price) ~ carat_size + color, data = Diamonds) msummary(diamondlm) diamondpredict <- makeFun(diamondlm) diamonddata <- augment(diamondlm) |> # To get fitted values janitor::clean_names() glimpse(diamonddata) gf_point(sqrt_price ~ carat_size, color = ~ color, data = diamonddata) |> gf_line(fitted ~ carat_size) |> gf_labs(x = "Carat Size", y = "sqrt(Price)") + ylim(30, 100) ``` ```{r warning = FALSE} # With interaction diamondlm2 <- lm(sqrt(price) ~ carat_size * color, data = Diamonds) msummary(diamondlm2) gf_point(sqrt(price) ~ carat_size, color = ~ color, data = Diamonds) |> gf_lm() |> gf_labs(x = "Carat Size", y = "sqrt(Price)") + ylim(30, 100) ```