---
title: "IS5 in R: Linear Regression (Chapter 7)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "2025-01-08"
date-format: iso
format: pdf
toc: true
editor: source
---

```{r}
#| label: setup
#| include: false
library(mosaic)   
library(tidyverse)
```


## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

We begin by loading packages that will be required for our analyses.

```{r}
library(mosaic)
library(tidyverse)
```


## Chapter 7: Linear Regression

```{r}
#| message: false
# Figure 7.1
BurgerKing <- read_csv("http://nhorton.people.amherst.edu/is5/data/Burger_King_items.csv") |>
  janitor::clean_names()
```

By default, `read_csv()` prints the variable names.
These messages have been suppressed using the `message: false` code chunk option to save space and improve readability. 
Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace).  

```{r}
#| message: false
gf_point(fat_g ~ protein_g, data = BurgerKing) |>
  gf_smooth() |>
  gf_labs(x = "Protein (g)", y = "Fat (g)")
```

Here we add a smoother to show a clearer picture of the relationship.  

### Section 7.1: Least Squares: The Line of "Best Fit"

See display on page 197.

We can calculate the residual for a particular value with 31 grams of protein by creating an function called `burgerfun` using the 
`mosaic::makeFun()` function.  

```{r}
burgerlm <- lm(fat_g ~ protein_g, data = BurgerKing)
burgerfun <- makeFun(burgerlm)
burgerfun(protein_g = 31)
```

### Section 7.2: The Linear Model

```{r}
coef(burgerlm)
burgerfun(protein_g = 0)
burgerfun(32) - burgerfun(31)
```

```{r}
msummary(burgerlm)
```

#### Example 7.1: A Linear Model for Hurricanes

We begin by reading in the data.

```{r}
#| message: false
Hurricanes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Hurricanes_2015.csv") |>
  janitor::clean_names()
gf_point(max_wind_speed_kts ~ central_pressure_mb, data = Hurricanes) |>
  gf_lm() |>
  gf_labs(x = "Central Pressure (mb)", y = "Max Wind Speed (kts)")
```

The function generates a warning because some of the data are missing: this output can (and should!) be suppressed by adding `warning: false` as an option in this code chunk.
Later examples will suppress this extraneous output.

### Section 7.3: Finding the Least Squares Line

#### Example 7.2: Finding the Regression Equation

```{r}
df_stats(~ protein_g, data = BurgerKing)
df_stats(~ fat_g, data = BurgerKing)
sx <- sd(~ protein_g, data = BurgerKing)
sx
sy <- sd(~ fat_g, data = BurgerKing)
sy
r <- cor(protein_g ~ fat_g, data = BurgerKing)
r # same as cor(fat_g ~ protein_g, data = BurgerKing)!
r * sy / sx
coef(burgerlm)[2] # pulls off the second coefficient
```

#### Step-by-Step Example: Calculating a Regression Equation

We begin by loading the bridge dataset.

```{r}
#| message: false
TompkinsBridges <-
  read_csv("http://nhorton.people.amherst.edu/is5/data/Tompkins_county_bridges_2016.csv") |>
  janitor::clean_names()
gf_point(condition ~ age_at_inspection, data = TompkinsBridges) |>
  gf_smooth() |> # To show relationship
  gf_labs(x = "Age at Inspection", y = "Condition")
```

See calculations on page 203.

### Section 7.4: Regression to the Mean

See Figure 7.4 on page 205 to visualize standard deviations.  

### Section 7.5: Examining the Residuals

```{r}
msummary(burgerlm)
# Figure 7.5 , page 207
gf_point(resid(burgerlm) ~ protein_g, data = BurgerKing) |>
  gf_lm() |>
  gf_labs(x = "Protein (g)", y = "Residuals (g fat)")
# Figure 7.6
gf_histogram(~ resid(burgerlm), binwidth = 10, center = 5) |>
  gf_labs(x = "Residuals (g fat)", y = "# of Residuals")
```

### Section 7.6: $R^2$--The Proportion of Variation Accounted for by the Model

```{r}
rsquared(burgerlm)
```

### Section 7.7: Regression Assumptions and Conditions