---
title: "IS5 in R: Multiple Regression (Chapter 9)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "2025-03-07"
date-format: iso
format: pdf
toc: true
editor: source
---

```{r}
#| label: setup
#| include: false
library(mosaic)   
library(tidyverse)
```


## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

We begin by loading packages that will be required for our analyses.

```{r}
library(mosaic)
library(tidyverse)
library(broom) # We'll use this for augment() later
```


## Chapter 9: Multiple Regression

```{r}
#| message: false
BodyFat <- read_csv("http://nhorton.people.amherst.edu/is5/data/Bodyfat.csv") |>
  janitor::clean_names()
```

By default, `read_csv()` prints the variable names.
These messages have been suppressed using the `message: false` code chunk option to save space and improve readability. 
Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace).   

```{r}
#| message: false
# Figure 9.1, page 276
gf_point(pct_bf ~ waist, data = BodyFat) |>
  gf_labs(x = "Waist", y = "% Body Fat") |>
  gf_smooth()
```

We've added `gf_smooth()` to demonstrate how to add a smoother.  

### Section 9.1: What is Multiple Regression?

```{r}
# Table 9.1, page 277
multiplereg <- lm(pct_bf ~ waist + height, data = BodyFat)
summary(multiplereg)
```

The `summary()` function provides the multiple R-squared along with the regression coefficients.  

```{r}
msummary(multiplereg)
```

The `msummary()` function in the `mosaic` package provides a pruned version of the same output.

```{r}
broom::tidy(multiplereg)
```

The `tidy()` function in the `broom` package provides similar information as a tibble/dataframe.


#### Example 9.1: Modeling Home Prices

```{r}
#| message: false
RealEstate <- read_csv("http://nhorton.people.amherst.edu/is5/data/Real_Estate.csv") |>
  janitor::clean_names()
realestatelm <- lm(price ~ living_area + bedrooms, data = RealEstate)
summary(realestatelm)
```

Here we demonstrate how to create a function in R that can be used to calculate predicted values from a regression model.

```{r}
# Predicted Values
realestatefn <- makeFun(realestatelm) # Making a function to find predicted values
# Predicted price for a home with 2800 sq ft living area and 5 bedrooms
realestatefn(living_area = 2800, bedrooms = 5)
# Predicted price for a home with 2801 sq ft living area and 5 bedrooms
realestatefn(living_area = 2801, bedrooms = 5)
# If we subtract predicted values one value apart, we get the slope
realestatefn(living_area = 2801, bedrooms = 5) - 
  realestatefn(living_area = 2800, bedrooms = 5)
```

### Section 9.2: Interpreting Multiple Regression Coefficients

```{r}
#| message: false
# Figure 9.2, page 279
gf_point(pct_bf ~ height, data = BodyFat) |>
  gf_smooth() |> # Added a smoother to assess linearity
  gf_labs(x = "Height (in)", y = "% Body Fat")
```

A message about the default smoother option was suppressed by adding `message: false` as a code chunk option.

```{r}
#| warning: false
# Figure 9.3
BodyFat |>
  filter(waist >= 36 & waist <= 38) |> # Just plotting waist sizes between 36 and 38 inches
  gf_point(pct_bf ~ height) |>
  gf_labs(x = "Height (in)", y = "% Body Fat") |>
  gf_lm()
```

```{r, fig.width=7}
# Plotting all points
BodyFat |>
  mutate(waistsize = ifelse(waist >= 36 & waist <= 38, "Between 36 and 38 inches",
    "Not Between 36 and 38 inches"
  )) |> # Subsetting
  gf_point(pct_bf ~ height, shape = ~ waistsize, color = ~ waistsize) |>
  gf_labs(
    x = "Height (in)", 
    y = "% Body Fat", 
    shape = "Waist size", 
    color = "Waist size"
  ) |>
  gf_lm()
```

### Section 9.3: The Multiple Regression Model--Assumptions and Conditions

#### Linearity Assumption

#### Equal Variance Assumption

We can assess the equal variance assumption in several ways.
The simplest is through a scatterplot of residuals vs. fitted values.

```{r}
#| message: false
bodyfatlm <- lm(pct_bf ~ waist + height, data = BodyFat)
mosaic::msummary(bodyfatlm)
# Figure 9.4, page 282
gf_point(resid(bodyfatlm) ~ fitted(bodyfatlm)) |>
  gf_lm(linetype = 2, color = "black") |>
  gf_smooth(color = "green") |>
  gf_labs(x = "Predicted", y = "Residuals")
```

#### Check the Residuals

It's important to look at the residuals to see if the "Nearly Normal" condition is reasonable.

```{r}
# Figure 9.5
gf_dhistogram(~ resid(bodyfatlm), binwidth = 1.5, center = 0.75) |>
  gf_dist("norm", color = "red", sd = 4.46) |> # see residual SE
  gf_labs(x = "Residuals", y = "Counts")
gf_qq(~ resid(bodyfatlm)) |>
  gf_qqline(linetype = "solid", color = "red") |>
  gf_labs(x = "Normal Scores", y = "Residuals (% Body Fat)")
```

Alternatively we can generate the QQ plot using the `mplot()` function.

```{r}
#| message: false
mplot(bodyfatlm, which = 2)
```

#### Step-By-Step Example: Multiple Regression

We begin by reading in the data for the step-by-step example.

```{r}
#| message: false
HousingPrices <- 
  read_csv("http://nhorton.people.amherst.edu/is5/data/Housing_prices.csv") |>
  janitor::clean_names()
gf_point(price ~ living_area, data = HousingPrices) |>
  gf_smooth() |>
  gf_labs(x = "Living Area", y = "Price")
```

For this and other plots the y axis labels would be far easier to read if the values were rescaled.
Here we demonstrate this but continue to mirror the book output for the other displays.

```{r}
#| message: false
HousingRescaled <- HousingPrices |>
  mutate(price1000 = price / 1000)
gf_point(price1000 ~ living_area, data = HousingRescaled) |>
  gf_smooth() |>
  gf_labs(x = "Living Area (sf)", y = "Price (in thousands)")
```


```{r}
#| message: false
gf_boxplot(price ~ as.factor(bedrooms), data = HousingPrices) |>
  gf_labs(x = "Bedrooms", y = "Price")
```

```{r}
#| message: false
gf_point(price ~ age, data = HousingPrices) |>
  gf_smooth() |>
  gf_labs(x = "Age", y = "Price")
gf_point(price ~ log(age + 1), data = HousingPrices) |>
  gf_smooth() |>
  gf_labs(x = "LogAge", y = "Price")
```


```{r}
#| message: false
housinglm <- lm(price ~ log(age + 1), data = HousingPrices)
gf_point(resid(housinglm) ~ fitted(housinglm)) |>
  gf_smooth() |>
  gf_labs(x = "Predicted Values", y = "Residuals")
```

A similar plot can be generated using `mplot()`:

```{r}
#| message: false
mplot(housinglm, which = 1)
```

Let's fit the model described on page 285.

```{r}
housinglm2 <- lm(price ~ living_area + log(age + 1) + bedrooms, data = HousingPrices)
msummary(housinglm2)
```

```{r}
#| message: false
mplot(housinglm2, which = 1)
```

```{r}
#| message: false
mplot(housinglm2, which = 2)
```

We see that the NEARLY NORMAL condition isn't well satisfied: there are heavy tails particularly for higher priced homes.

### Section 9.4: Partial Regression Plots

```{r}
# Figure 9.6 (instructions on 287)
# Step 1
otherthanheightlm <- lm(pct_bf ~ waist, data = BodyFat)
# Step 2
residualsoflm <- resid(otherthanheightlm)
# Step 3
yheightlm <- lm(height ~ waist, data = BodyFat)
# Step 4
residualsoflm2 <- resid(yheightlm)
# Step 5
gf_point(residualsoflm ~ residualsoflm2) |>
  gf_lm() |>
  gf_labs(x = "Height Residuals (in.)", y = "% Body Fat Residuals")
```

#### Just Checking

```{r}
#| message: false
Hurricanes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Hurricanes_2015.csv") |>
  janitor::clean_names()
hurricanelm <- lm(max_wind_speed_kts ~ year + central_pressure_mb, data = Hurricanes)
msummary(hurricanelm)
```

### Section 9.5: Indicator Variables

```{r}
#| label: coaster
#| message: false 
Coasters <- read_csv("http://nhorton.people.amherst.edu/is5/data/Coasters_2015.csv")
# Table 9.2, page 288
head(Coasters)
# Figure 9.7
# Tower of Terror isn't included by the book, so we need to drop it
Coasters <- Coasters |>
  filter(Name != "Tower of Terror") |>
  mutate(Inversions = as.factor(Inversions)) # turn the variable into a factor
```

```{r}
#| warning: false
gf_point(Duration ~ Drop, data = Coasters) |>
  gf_lm()
```

```{r error}
coasterlm <- lm(Duration ~ Drop, data = Coasters)
gf_point(resid(coasterlm) ~ fitted(coasterlm)) |>
  gf_labs(x = "Predicted", y = "Residuals")
msummary(coasterlm)
# Figure 9.8
gf_point(Duration ~ Drop, color = ~Inversions, data = Coasters) |>
  gf_lm() |>
  gf_labs(color = "Inversions")
```

Here it would be appropriate to add `warning: false` as a code chunk option once we've verified that there are indeed 150 observations missing 

```{r}
coasterlm2 <- lm(Duration ~ Drop + Inversions, data = Coasters)
msummary(coasterlm2)
```

```{r}
coasterlm2asdata <- broom::augment(coasterlm2) # another helpful function
broom::glance(coasterlm2) |> data.frame()
```

```{r}
gf_point(.resid ~ .fitted, color = ~ Inversions, data = coasterlm2asdata)
```

The `augment()` function from the `broom` package creates a data frame from a linear model that includes a column for residuals, fitted values, etc. 
Here we use `names()` to check out the column names and `glance()` to view the structure of the data set.  

#### Example 9.3: Using Indicator Variables

We can explore the use of indicator variables to model categorical variables.

```{r}
#| message: false
DirtBikes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Dirt_bikes_2014.csv")
DirtBikes <- DirtBikes |>
  filter(Cooling != "NA") |>
  mutate(Cooling = ifelse(Cooling == "Air-Cooled", "Air-Cooled", "LiquidCooled"))
gf_point(MSRP ~ (Displacement)^(1 / 3), color = ~ Cooling, data = DirtBikes) |>
  gf_lm()
bikeslm <- lm(MSRP ~ I(Displacement^(1 / 3)) + Cooling, data = DirtBikes)
msummary(bikeslm)
```

The `I()` function is used to keep the class of an object the same. Here we use it to keep the variable `Displacement` "as is" to prevent an error.  

#### Adjusting for Different Slopes

We can fit a model with different slopes.

```{r}
#| message: false
BurgerKing <- 
  read_csv("http://nhorton.people.amherst.edu/is5/data/Burger_King_items.csv") |>
  janitor::clean_names()
# Figure 9.9, page 292
gf_point(calories ~ carbs_g, data = BurgerKing) |>
  gf_labs(x = "Carbs (g)", y = "Calories")
# Figure 9.10
gf_point(calories ~ carbs_g, color = ~ as.factor(meat), data = BurgerKing) |>
  gf_labs(x = "Carbs (g)", y = "Calories", color = "Meat") |>
  gf_lm()
msummary(lm(calories ~ carbs_g * as.factor(meat), data = BurgerKing))
```

The output here is a bit ugly: it would be straightforward to create the new variables using `mutate()` to provide easier to read output.

#### One, Two, Many

We can also consider three level variables.

```{r}
#| label: cereal1
#| message: false
Cereal <- read_csv("http://nhorton.people.amherst.edu/is5/data/Cereals.csv")
cereallm <- lm(sugars ~ sodium + as.factor(shelf), data = Cereal)
gf_point(sugars ~ sodium, color = ~ as.factor(shelf), data = Cereal) |>
  gf_lm() |>
  gf_labs(x = "Sodium", y = "Sugars", color = "Shelf")
msummary(cereallm)
```

#### Example 9.4: Indicators for Variables with Several Levels

We will read in the diamonds data.

```{r}
#| label: diamonds
#| message: false
#| warning: false
Diamonds <- read_csv("http://nhorton.people.amherst.edu/is5/data/Diamonds.csv") |>
  janitor::clean_names()

# Parallel Slopes
diamondlm <- lm(sqrt(price) ~ carat_size + color, data = Diamonds)
msummary(diamondlm)
diamondpredict <- makeFun(diamondlm)

diamonddata <- augment(diamondlm) |> # To get fitted values
  janitor::clean_names()
glimpse(diamonddata)
gf_point(sqrt_price ~ carat_size, color = ~ color, data = diamonddata) |>
  gf_line(fitted ~ carat_size) |>
  gf_labs(x = "Carat Size", y = "sqrt(Price)") +
  ylim(30, 100)
```

```{r}
#| warning: false
# With interaction
diamondlm2 <- lm(sqrt(price) ~ carat_size * color, data = Diamonds)
msummary(diamondlm2)
gf_point(sqrt(price) ~ carat_size, color = ~ color, data = Diamonds) |>
  gf_lm() |>
  gf_labs(x = "Carat Size", y = "sqrt(Price)") +
  ylim(30, 100)
```