---
title: "IPS9 in R: Inference for Regression (Chapter 10)"
author: "Bonnie Lin and Nicholas Horton (nhorton@amherst.edu)"
date: "July 19, 2018"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 5
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
require(mosaic)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
knitr::opts_chunk$set(
  tidy = FALSE,     # display code as typed
  size = "small"    # slightly smaller font for code
)
```

## Introduction and background 

These documents are intended to help describe how to undertake analyses introduced 
as examples in the Ninth Edition of \emph{Introduction to the Practice of Statistics} (2017) by Moore, McCabe, and Craig.

More information about the book can be found [here](https://macmillanlearning.com/Catalog/product/introductiontothepracticeofstatistics-ninthedition-moore).
The data used in these documents can be found under Data Sets in the [Student Site](https://www.macmillanlearning.com/catalog/studentresources/ips9e?_ga=2.29224888.526668012.1531487989-1209447309.1529940008#). This
file as well as the associated R Markdown reproducible analysis source file used to create it can be found at https://nhorton.people.amherst.edu/ips9/.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic).  A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.
  
## Chapter 10: Inference for regression
This file replicates the analyses from Chapter 10: Inference for regression.

First, load the packages that will be needed for this document: 
```{r load-packages}
library(mosaic)
library(readr)
```

### Section 10.1: Simple linear regression
```{r eg10-01,  message=FALSE}
PABME <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter10/EG10-01PABMI.csv")
### Figure 10.3, page 559
gf_point(BMI ~ PA, data = PABME) %>%
  gf_lm() %>%
  gf_labs(x = "PA (Thousands of steps)", y = "BMI (kg/m^2)")
# gf_lm() adds the least-squares line

## Creating the linear model with the lm() function
lm_PABME <- lm(BMI ~ PA, data = PABME)
## Displaying the output with the msummary() function
msummary(lm_PABME)
```
By default, the `read_csv()` function will output the types of columns, as we see above. 
To improve readability for future coding, we will suppress the "Parsed with column 
specification" message by adding `message = FALSE` at the top of the code chunks.

You would interpret this output by reporting the model as 
$\hat{BMI} = 29.578 - 0.655 (PA)$, the same way as the textbook does on page 
563. This equation also defines the straight line that was plotted in Figure 10.3, 
using the `gf_lm()` function.

Suppose that a female college student averages 8000 steps per day. By making 
the linear model into a function, we can predict the BMI of the person. 
```{r eg10-4}
### Example 10.4, page 564
PABME_mod <- makeFun(lm_PABME)
PABME_mod(8) # BMI estimate 

### Figure 10.5, page 566
gf_point(resid(lm_PABME) ~ PA, data = PABME) %>%
  gf_smooth(span = 2) %>%
  gf_labs(x = "PA (Thousands of steps)", y = "Residual") 

### Figure 10.6, page 566
gf_qq(~ rstandard(lm_PABME)) %>%
  gf_qqline() %>%
  gf_labs(x = "Normal score", y = "Residual")
```

If the student's actual BMI is 25.66, you can easily find the residual by calculating the 
difference between the actual and the predicted values. In this case, the residual would 
be 1.317. 

Note that plotting the Normal quantile plot of the residuals requires the `rstandard()` 
function call. 

```{r eg10-7}
### Figure 10.7 and 10.8, page 571
gf_point(BMI ~ PA, data = PABME) %>%
  gf_lm(interval = "confidence", fill = "red") %>%
  gf_lm(interval = "prediction", fill = "navy") %>%
  gf_labs(x = "PA (thousands of steps)", y = "BMI (kg/m^2)")
```

The red bands show the 95% confidence limits, while the navy bands show the 
95% prediction limits. 


```{r eg10-11, message=FALSE}
ENTRE <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter10/EG10-11ENTRE.csv")
### Figure 10.9, page 575
gf_point(INC ~ EDUC, data = ENTRE) %>%
  gf_smooth(span = 2) %>%
  gf_lm(color = "red") 

### Figure 10.10, page 575
log_inc_ENTRE <- ENTRE %>% 
  mutate(log_INC = log(INC))
gf_point(log_INC ~ EDUC, data = log_inc_ENTRE) %>%
  gf_smooth(span = 2) %>%
  gf_lm(color = "red") %>%
  gf_labs(y = "log(INC)")
```

On this scatterpot of income versus education, we have plotted the smooth function (blue) 
and the least-squares line (red). 

### Section 10.2: More details about simple linear regression

```{r eg10-16, message=FALSE}
GADIA <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter10/EG10-16GADIA.csv") %>%
  select(Diameter, GA) %>%
  na.omit()

### Example 10.17, page 589
favstats(~ Diameter, data = GADIA)
favstats(~ GA, data = GADIA)
cor(GA ~ Diameter, data = GADIA)

### Example 10.18, 10.20, and 10.21 page 589
lm_GADIA <- lm(GA ~ Diameter, data = GADIA)
msummary(lm_GADIA) 
```

From the msummary output, you can compute the least-squares
regression line, estimate the standard deviation about the line 
and test the slope by using the t statistic and P-value
by the 'Diameter' variable

```{r eg10-16cont, message = FALSE}
### Example 10.19, page 590 
gf_point(resid(lm_GADIA) ~ Diameter, data = GADIA) %>%
  gf_lm() %>%
  gf_labs(x = "Diameter", y = "Residual")

### Example 10.22, page 592
confint(lm_GADIA, 'Diameter')

### Example 10.23, page 593
new.dat <- data.frame(Diameter = 10)
predict(lm_GADIA, newdata = new.dat, interval = 'confidence')
```
Since there are only two columns in this dataset that are not filled with NA's, I have 
used the `select()` and `na.omit()` functions to select columns that we will use for 
further analysis.