---
title: "IS5 in R: Multiple Regression (Chapter 9)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "December 17, 2020"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 5
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
library(mosaic)
library(readr)
library(janitor)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE, # display code as typed
  size = "small" # slightly smaller font for code
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

## Chapter 9: Multiple Regression

```{r message = FALSE}
library(mosaic)
library(readr)
library(janitor)
library(broom) # We'll use this for augment() later
BodyFat <- read_csv("http://nhorton.people.amherst.edu/is5/data/Bodyfat.csv") %>%
  janitor::clean_names()
```

By default, `read_csv()` prints the variable names.
These messages have been suppressed using the `message=FALSE` code chunk option to save space and improve readability. 
Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace).   

```{r}
# Figure 9.1, page 276
gf_point(pct_bf ~ waist, data = BodyFat) %>%
  gf_labs(x = "Waist", y = "% Body Fat") %>%
  gf_smooth()
```

We've added `gf_smooth()` to demonstrate how to add a smoother.  

### Section 9.1: What is Multiple Regression?

```{r}
# Table 9.1, page 277
multiplereg <- lm(pct_bf ~ waist + height, data = BodyFat)
summary(multiplereg)
```

The `summary()` function provides the multiple R-squared along with the regression coefficients.  

#### Example 9.1: Modeling Home Prices

```{r message = FALSE}
RealEstate <- read_csv("http://nhorton.people.amherst.edu/is5/data/Real_Estate.csv") %>%
  janitor::clean_names()
realestatelm <- lm(price ~ living_area + bedrooms, data = RealEstate)
summary(realestatelm)
```

```{r}
# Predicted Values
realestatefn <- makeFun(realestatelm) # Making a function to find predicted values
# Predicted price for a home with 2800 sq ft living area and 5 bedrooms
realestatefn(living_area = 2800, bedrooms = 5)
# Predicted price for a home with 2801 sq ft living area and 5 bedrooms
realestatefn(living_area = 2801, bedrooms = 5)
# If we subtract predicted values one value apart, we get the slope
realestatefn(living_area = 2801, bedrooms = 5) - realestatefn(living_area = 2800, bedrooms = 5)
```

### Section 9.2: Interpreting Multiple Regression Coefficients

```{r message = FALSE}
# Figure 9.2, page 279
gf_point(pct_bf ~ height, data = BodyFat) %>%
  gf_smooth() %>% # Added a smoother to assess linearity
  gf_labs(x = "Height (in)", y = "% Body Fat")
```

A message about the default smoother option was suppressed by adding `message = FALSE` as a code chunk option.

```{r}
# Figure 9.3
BodyFat %>%
  filter(waist >= 36 & waist <= 38) %>% # Just plotting waist sizes between 36 and 38 inches
  gf_point(pct_bf ~ height) %>%
  gf_labs(x = "Height (in)", y = "% Body Fat") %>%
  gf_lm()
```

```{r, fig.width=7}
# Plotting all points
BodyFat %>%
  mutate(waistsize = ifelse(waist >= 36 & waist <= 38, "Between 36 and 38 inches",
    "Not Between 36 and 38 inches"
  )) %>% # Subsetting
  gf_point(pct_bf ~ height, shape = ~waistsize, color = ~waistsize) %>%
  gf_labs(x = "Height (in)", y = "% Body Fat", shape = "Waist size", color = "Waist size") %>%
  gf_lm()
```

### Section 9.3: The Multiple Regression Model--Assumptions and Conditions

#### Linearity Assumption

#### Equal Variance Assumption

We can assess the equal variance assumption in several ways.
The simplest is through a scatterplot of residuals vs. fitted values.

```{r}
bodyfatlm <- lm(pct_bf ~ waist + height, data = BodyFat)
# Figure 9.4, page 282
gf_point(resid(bodyfatlm) ~ fitted(bodyfatlm)) %>%
  gf_lm() %>%
  gf_labs(x = "Predicted", y = "Residuals")
```

#### Check the Residuals

It's important to look at the residuals to see if the "Nearly Normal" condition is reasonable to assume.

```{r}
# Figure 9.5
gf_histogram(~ resid(bodyfatlm), binwidth = 1.5, center = 0.75) %>%
  gf_labs(x = "Residuals", y = "Counts")
gf_qq(~ resid(bodyfatlm)) %>%
  gf_qqline(linetype = "solid", color = "red") %>%
  gf_labs(x = "Normal Scores", y = "Residuals (% Body Fat)")
```

#### Step-By-Step Example: Multiple Regression

We begin by reading in the data for the step-by-step example.

```{r message = FALSE}
HousingPrices <- read_csv("http://nhorton.people.amherst.edu/is5/data/Housing_prices.csv") %>%
  janitor::clean_names()
gf_point(price ~ living_area, data = HousingPrices) %>%
  gf_smooth() %>%
  gf_labs(x = "Living Area", y = "Price")
gf_boxplot(price ~ as.factor(bedrooms), data = HousingPrices) %>%
  gf_labs(x = "Bedrooms", y = "Price")
```

```{r}
gf_point(price ~ age, data = HousingPrices) %>%
  gf_smooth() %>%
  gf_labs(x = "Age", y = "Price")
gf_point(price ~ log(age + 1), data = HousingPrices) %>%
  gf_smooth() %>%
  gf_labs(x = "LogAge", y = "Price")
housinglm <- lm(price ~ log(age + 1), data = HousingPrices)
gf_point(resid(housinglm) ~ fitted(housinglm)) %>%
  gf_smooth() %>%
  gf_labs(x = "Predicted Values", y = "Residuals")
housinglm2 <- lm(price ~ living_area + log(age + 1) + bedrooms, data = HousingPrices)
msummary(housinglm2)
```

### Section 9.4: Partial Regression Plots

```{r}
# Figure 9.6 (instructions on 287)
# Step 1
otherthanheightlm <- lm(pct_bf ~ waist, data = BodyFat)
# Step 2
residualsoflm <- resid(otherthanheightlm)
# Step 3
yheightlm <- lm(height ~ waist, data = BodyFat)
# Step 4
residualsoflm2 <- resid(yheightlm)
# Step 5
gf_point(residualsoflm ~ residualsoflm2) %>%
  gf_lm() %>%
  gf_labs(x = "Height Residuals (in.)", y = "% Body Fat Residuals")
```

#### Just Checking

```{r message = FALSE}
Hurricanes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Hurricanes_2015.csv") %>%
  janitor::clean_names()
hurricanelm <- lm(max_wind_speed_kts ~ year + central_pressure_mb, data = Hurricanes)
msummary(hurricanelm)
```

### Section 9.5: Indicator Variables

```{r coaster, message = FALSE}
Coasters <- read_csv("http://nhorton.people.amherst.edu/is5/data/Coasters_2015.csv")
# Table 9.2, page 288
head(Coasters)
# Figure 9.7
# Tower of Terror isn't included by the book
Coasters <- Coasters %>%
  filter(Name != "Tower of Terror") %>%
  mutate(Inversions = as.factor(Inversions))
```

```{r warning = FALSE}
gf_point(Duration ~ Drop, data = Coasters) %>%
  gf_lm()
```

```{r error}
coasterlm <- lm(Duration ~ Drop, data = Coasters)
gf_point(resid(coasterlm) ~ fitted(coasterlm)) %>%
  gf_labs(x = "Predicted", y = "Residuals")
msummary(coasterlm)
# Figure 9.8
gf_point(Duration ~ Drop, color = ~Inversions, data = Coasters) %>%
  gf_lm() %>%
  gf_labs(color = "Inversions")
```

```{r}
coasterlm2 <- lm(Duration ~ Drop + Inversions, data = Coasters)
msummary(coasterlm2)
```

```{r}
coasterlm2asdata <- augment(coasterlm2)
glance(coasterlm2) %>% data.frame()
```

```{r}
gf_point(.resid ~ .fitted, color = ~Inversions, data = coasterlm2asdata)
```

The `augment()` function creates a data frame from a linear model that includes a column for residuals, fitted values, etc. Here we use `names()` to check out the column names and `glance()` to view the structure of the data set.  

#### Example 9.3: Using Indicator Variables

We can explore the use of indicator variables to model categorical variables.

```{r message = FALSE}
DirtBikes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Dirt_bikes_2014.csv")
DirtBikes <- DirtBikes %>%
  filter(Cooling != "NA") %>%
  mutate(Cooling = ifelse(Cooling == "Air-Cooled", "Air-Cooled", "LiquidCooled"))
gf_point(MSRP ~ (Displacement)^(1 / 3), color = ~Cooling, data = DirtBikes) %>%
  gf_lm()
bikeslm <- lm(MSRP ~ I(Displacement^(1 / 3)) + Cooling, data = DirtBikes)
msummary(bikeslm)
```

The `I()` function is used to keep the class of an object the same. Here we use it to keep the variable `Displacement` "as is" to prevent an error.  

#### Adjusting for Different Slopes

We can fit a model with different slopes.

```{r message = FALSE}
BurgerKing <- read_csv("http://nhorton.people.amherst.edu/is5/data/Burger_King_items.csv") %>%
  clean_names()
# Figure 9.9, page 292
gf_point(calories ~ carbs_g, data = BurgerKing) %>%
  gf_labs(x = "Carbs (g)", y = "Calories")
# Figure 9.10
gf_point(calories ~ carbs_g, color = ~ as.factor(meat), data = BurgerKing) %>%
  gf_labs(x = "Carbs (g)", y = "Calories", color = "Meat") %>%
  gf_lm()
msummary(lm(calories ~ carbs_g * as.factor(meat), data = BurgerKing))
```

#### One, Two, Many

We can also consider three level variables.

```{r cereal1, message = FALSE}
Cereal <- read_csv("http://nhorton.people.amherst.edu/is5/data/Cereals.csv")
cereallm <- lm(sugars ~ sodium + as.factor(shelf), data = Cereal)
gf_point(sugars ~ sodium, color = ~ as.factor(shelf), data = Cereal) %>%
  gf_lm() %>%
  gf_labs(x = "Sodium", y = "Sugars", color = "Shelf")
msummary(cereallm)
```

#### Example 9.4: Indicators for Variables with Several Levels

We will read in the diamonds data.

```{r diamonds, message = FALSE, warning = FALSE}
Diamonds <- read_csv("http://nhorton.people.amherst.edu/is5/data/Diamonds.csv") %>%
  clean_names()

# Parallel Slopes
diamondlm <- lm(sqrt(price) ~ carat_size + color, data = Diamonds)
msummary(diamondlm)
diamondpredict <- makeFun(diamondlm)

diamonddata <- augment(diamondlm) %>% # To get fitted values
  clean_names()
str(diamonddata)
gf_point(sqrt_price ~ carat_size, color = ~color, data = diamonddata) %>%
  gf_line(fitted ~ carat_size) %>%
  gf_labs(x = "Carat Size", y = "sqrt(Price)") +
  ylim(30, 100)
```

```{r warning = FALSE}
# With interaction
diamondlm2 <- lm(sqrt(price) ~ carat_size * color, data = Diamonds)
msummary(diamondlm2)
gf_point(sqrt(price) ~ carat_size, color = ~color, data = Diamonds) %>%
  gf_lm() %>%
  gf_labs(x = "Carat Size", y = "sqrt(Price)") +
  ylim(30, 100)
```