---
title: "IS5 in R: Linear Regression (Chapter 7)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "December 19, 2020"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 5
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
library(mosaic)
library(readr)
library(janitor)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE, # display code as typed
  size = "small" # slightly smaller font for code
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

## Chapter 7: Linear Regression

```{r message = FALSE}
library(mosaic)
library(readr)
library(janitor)
# Figure 7.1
BurgerKing <- read_csv("http://nhorton.people.amherst.edu/is5/data/Burger_King_items.csv") %>%
  janitor::clean_names()
```

By default, `read_csv()` prints the variable names.
These messages have been suppressed using the `message=FALSE` code chunk option to save space and improve readability. 
Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace).  

```{r message = FALSE}
gf_point(fat_g ~ protein_g, data = BurgerKing) %>%
  gf_smooth() %>%
  gf_labs(x = "Protein (g)", y = "Fat (g)")
```

Here we add a smoother to show a clearer picture of the relationship.  

### Section 7.1: Least Squares: The Line of "Best Fit"

See display on page 197.

We can calculate the residual for a particular value with 31 grams of protein by creating an function called `burgerfun` using the 
`mosaic::makeFun()` function.  

```{r}
burgerlm <- lm(fat_g ~ protein_g, data = BurgerKing)
burgerfun <- makeFun(burgerlm)
burgerfun(protein_g = 31)
```

### Section 7.2: The Linear Model

```{r}
coef(burgerlm)
burgerfun(protein_g = 0)
burgerfun(32) - burgerfun(31)
```

```{r}
msummary(burgerlm)
```

#### Example 7.1: A Linear Model for Hurricanes

We begin by reading in the data.

```{r message = FALSE}
Hurricanes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Hurricanes_2015.csv") %>%
  janitor::clean_names()
gf_point(max_wind_speed_kts ~ central_pressure_mb, data = Hurricanes) %>%
  gf_lm() %>%
  gf_labs(x = "Central Pressure (mb)", y = "Max Wind Speed (kts)")
```

The function generates a warning because some of the data are missing: this output can (and should!) be suppressed by adding `warning=FALSE` as an option in this code chunk.
Later examples will suppress this extraneous output.

### Section 7.3: Finding the Least Squares Line

#### Example 7.2: Finding the Regression Equation

```{r}
df_stats(~protein_g, data = BurgerKing)
df_stats(~fat_g, data = BurgerKing)
sx <- sd(~protein_g, data = BurgerKing)
sx
sy <- sd(~fat_g, data = BurgerKing)
sy
r <- cor(protein_g ~ fat_g, data = BurgerKing)
r # same as cor(fat_g ~ protein_g)!
r * sy / sx
coef(burgerlm)[2]
```

#### Step-by-Step Example: Calculating a Regression Equation

We begin by loading the bridge dataset.

```{r message = FALSE}
TompkinsBridges <-
  read_csv("http://nhorton.people.amherst.edu/is5/data/Tompkins_county_bridges_2016.csv") %>%
  janitor::clean_names()
gf_point(condition ~ age_at_inspection, data = TompkinsBridges) %>%
  gf_smooth() %>% # To show relationship
  gf_labs(x = "Age at Inspection", y = "Condition")
```

See calculations on page 203.

### Section 7.4: Regression to the Mean

See Figure 7.4 on page 205 to visualize standard deviations.  

### Section 7.5: Examining the Residuals

```{r}
msummary(burgerlm)
# Figure 7.5 , page 207
gf_point(resid(burgerlm) ~ protein_g, data = BurgerKing) %>%
  gf_lm() %>%
  gf_labs(x = "Protein (g)", y = "Residuals (g fat)")
# Figure 7.6
gf_histogram(~ resid(burgerlm), binwidth = 10, center = 5) %>%
  gf_labs(x = "Residuals (g fat)", y = "# of Residuals")
```

### Section 7.6: $R^2$--The Proportion of Variation Accounted for by the Model

```{r}
rsquared(burgerlm)
```

### Section 7.7: Regression Assumptions and Conditions