---
title: "IS5 in R: Inferences for Regression (Chapter 20)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "December 13, 2020"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 5
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
library(mosaic)
library(readr)
library(janitor)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE, # display code as typed
  size = "small" # slightly smaller font for code
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

## Chapter 20: Inferences for Regression

```{r message = FALSE}
library(mosaic)
library(readr)
library(janitor)
BodyFat <- read_csv("http://nhorton.people.amherst.edu/is5/data/Bodyfat.csv") %>%
  janitor::clean_names()
```

By default, `read_csv()` prints the variable names.
These messages have been suppressed using the `message=FALSE` code chunk option to save space and improve readability. 
Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace).  

```{r}
# Figure 20.1, page 642
gf_point(pct_bf ~ waist, data = BodyFat) %>%
  gf_smooth() %>% # to show linear relationship
  gf_labs(x = "Waist (in.)", y = "% Body Fat")
```

### Section 20.1: The Regression Model

```{r}
lm(pct_bf ~ waist, data = BodyFat)
```

```{r}
# Figure 20.2
gf_histogram(~pct_bf, data = BodyFat, binwidth = 2.5, center = 1.25) %>%
  gf_labs(x = "% Body Fat", y = "# of Men")
# Figure 20.3 (reinterpreted with points)
BodyFat <- BodyFat %>%
  mutate(roundedwaist = cut(waist, breaks = c(0, 30, 35, 40, 45, Inf), labels = c(1:5)))
gf_point(pct_bf ~ jitter(as.numeric(roundedwaist)), data = BodyFat) %>%
  gf_lm() %>%
  gf_labs(y = "% Body Fat", x = "Waist")
```

#### Random Matters: Slopes Vary

```{r}
numsamp <- 25 # It's too messy to do any more than 25
slopesdata <- do(numsamp) * lm(pct_bf ~ waist, data = resample(BodyFat))
```

For more information about `resample()`, refer to the [resample vignette in mosaic](https://cran.r-project.org/web/packages/mosaic/vignettes/Resampling.pdf).  

```{r fig.height = 8, fig.width = 7}
slopesdata <- slopesdata %>%
  mutate(at27 = Intercept + waist * 27, at50 = Intercept + waist * 50, color = as.factor(1:25))
# Figure 20.4, page 644
gf_point(pct_bf ~ waist, data = BodyFat) %>%
  gf_segment(at27 + at50 ~ 27 + 50, data = slopesdata, color = ~color) %>%
  gf_labs(color = "Sample", x = "Waist (in.)", y = "Percent Body Fat")
```

```{r}
numsamp <- 1000 # To see the shape of the histogram
slopesdata <- do(numsamp) * lm(pct_bf ~ waist, data = resample(BodyFat))
# Figure 20.5
gf_histogram(~waist, data = slopesdata, binwidth = .05, center = .025) %>%
  gf_labs(x = "Slope", y = "# of Trials")
```

For the histogram, we use 1,000 trials.    

### Section 20.2: Assumptions and Conditions

```{r}
# Figure 20.6 is the same as Figure 20.1
# Figure 20.7 (page 645)
bodyfatlm <- lm(pct_bf ~ waist, data = BodyFat)
gf_point(resid(bodyfatlm) ~ fitted(bodyfatlm)) %>%
  gf_labs(x = "Predicted", y = "Residuals")
```

```{r message = FALSE}
Diamonds <- read_csv("http://nhorton.people.amherst.edu/is5/data/Diamonds.csv") %>%
  janitor::clean_names()
```

Here we fit `price` by `carat_size` for diamonds with the color `E`.  

```{r}
diamondlm <- lm(price ~ carat_size, data = filter(Diamonds, color == "E"))
# Figure 20.8, page 646
gf_point(resid(diamondlm) ~ fitted(diamondlm)) %>%
  gf_labs(x = "Predicted Values", y = "Residuals")
# Figure 20.9
gf_histogram(~ resid(bodyfatlm), binwidth = 2, center = 1) %>%
  gf_labs(x = "Residuals", y = "Count")
```

Figure 20.10 is intended to convey the same idea as Figure 20.3 (page 643).

#### Example 20.1: Checking Assumptions and Conditions

Note that points are removed to match the results in the textbook!

```{r message=FALSE}
Craters <- read_csv("http://nhorton.people.amherst.edu/is5/data/Craters.csv") %>%
  janitor::clean_names() %>%
  filter(log_age <= 1.5) # Removed points to match the textbook
```

```{r}
gf_point(log_diam ~ log_age, data = Craters) %>%
  gf_lm() %>%
  gf_labs(x = "Log(Age)", y = "Log(Diameter)")
craterlm <- lm(log_diam ~ log_age, data = Craters)
gf_point(resid(craterlm) ~ fitted(craterlm)) %>%
  gf_lm() %>%
  gf_labs(x = "Predicted", y = "Residuals")
gf_qq(~ resid(craterlm)) %>%
  gf_qqline(linetype = "solid", color = "red") %>%
  gf_labs(x = "Nscores", y = "Residuals")
gf_histogram(~ resid(craterlm), binwidth = .5, center = 0.25) %>%
  gf_labs(x = "Residuals", y = "# of Years")
```

#### Step-By-Step Example: Regression Inference

The following scatterplot matches Figure 20.1.

```{r}
gf_qq(~ resid(bodyfatlm)) %>%
  gf_qqline(linetype = "solid", color = "red") %>%
  gf_labs(x = "Normal Scores", y = "Residuals (% body fat)")
msummary(bodyfatlm)
```

### Section 20.3: Regression Inference and Intuition

See the displays on pages 650 and 651.

#### Example 20.2: Confidence Interval and Hypothesis Test for a Slope

```{r}
msummary(bodyfatlm)
mean <- 1.70
se <- .074
tstats <- qt(p = c(.025, .975), df = 248)
tstats
mean + tstats * se
t <- (mean - 0.00) / se
t
```

### Section 20.4: The Regression Table

```{r}
# Table 20.1, page 654
msummary(bodyfatlm)
```

### Section 20.5: Multiple Regression Inference

```{r}
# Table 20.2, page 655
bodyfatmlm <- lm(pct_bf ~ waist + height, data = BodyFat)
msummary(bodyfatmlm)
```

#### Just Checking

```{r}
Mouth <- read_csv("http://nhorton.people.amherst.edu/is5/data/Mouth_volume.csv")
mouthlm <- lm(Mouth_Volume ~ Height, data = Mouth) # simple linear model
df_stats(~Mouth_Volume, data = Mouth)
msummary(mouthlm)
mouthmlm <- lm(Mouth_Volume ~ Age + Height, data = Mouth) # multiple linear model
msummary(mouthmlm)
```

#### Collinearity

```{r}
Coasters <- read_csv("http://nhorton.people.amherst.edu/is5/data/Coasters_2015.csv")
Coasters <- Coasters %>%
  filter(Name != "Tower of Terror", Name != "Xcelerator") %>%
  # Removed artificially accelerated coasters and Tower of Terror
  filter(Drop != "NA", Duration != "NA") %>%
  mutate(Inversions = as.factor(Inversions))
coasterlm <- lm(Duration ~ Drop, data = Coasters) # simple linear model
msummary(coasterlm)
coastermlm <- lm(Duration ~ Drop + Speed, data = Coasters) # multiple linear regression model
msummary(coastermlm)
```

```{r}
gf_point(Speed ~ Drop, data = Coasters) %>%
  gf_lm()
```

### Section 20.6: Confidence and Prediction Intervals

```{r}
# Figure 20.16, page 659
gf_point(pct_bf ~ waist, data = BodyFat) %>%
  gf_lm(interval = "confidence", fill = "red") %>%
  gf_lm(interval = "prediction", fill = "navy") %>%
  gf_labs(x = "Waist (in.)", y = "% of Body Fat")
```

### Section 20.7: Logistic Regression

```{r}
PimaIndians <- read_csv("http://nhorton.people.amherst.edu/is5/data/Pima_indians.csv")
PimaIndians <- PimaIndians %>%
  filter(BMI != 0)
# Figure 20.17, page 661
PimaIndians %>%
  mutate(Diabetes = ifelse(Diabetes == 1, "Diabetic", "Not Diabetic")) %>%
  gf_boxplot(BMI ~ as.factor(Diabetes), xlab = "Diabetes (1 = yes)")
# Figure 20.21, page 663
gf_point(Diabetes ~ BMI, data = PimaIndians, ylab = "Diabetes (1 = yes)") %>%
  gf_smooth(method = "glm", method.args = list(family = "binomial"))
```

### Section 20.8: More About Regression