---
title: "Inferences for Regression (Chapter 23)"
author: "Patrick Frenett, Vickie Ip, and Nicholas Horton (nhorton@amherst.edu)"
date: "June 21, 2018"
output: 
  pdf_document:
    fig_height: 4
    fig_width: 6
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---
  
  
```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
require(mosaic)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE,     # display code as typed
  size = "small",    # slightly smaller font for code
  fig.align = "center"
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced 
as examples in the Fourth Edition of *Intro Stats* (2013) by De Veaux, Velleman, and Bock.
More information about the book can be found at http://wps.aw.com/aw_deveaux_stats_series.  This
file as well as the associated R Markdown reproducible analysis source file used to create it can be found at https://nhorton.people.amherst.edu/is4.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

Note that some of the figures in this document may differ slightly from those in the IS4 book due to small differences in datasets. However in all cases the analysis and techniques in R are accurate.

## Chapter 23: Inferences for Regression

### Section 23.1: The population and the sample

```{r message = FALSE}
library(mosaic)
library(readr)
BodyFat <- read_csv("https://nhorton.people.amherst.edu/sdm4/data/Body_fat_complete.csv")
```

```{r}
dim(BodyFat)
glimpse(BodyFat)
```

We can confirm the coefficients from the model on page 645.
```{r}
BodyFatmod <- lm(PctBF ~ waist, data = BodyFat)
coef(BodyFatmod)
```

### Section 23.2: Assumptions and conditions

We can regenerate the output and figures for the example on pages 647-651.

```{r}
msummary(BodyFatmod)
rsquared(BodyFatmod)
confint(BodyFatmod)    # see page 755
```

```{r message = FALSE}
# Figure 23.4
gf_point(PctBF ~ waist, data = BodyFat, alpha = 0.7) %>%
  gf_labs(x = "Waist (in.)") %>%
  gf_lm() %>%
  gf_smooth(col = "red", se = FALSE)

# Figure 23.5
gf_point(resid(BodyFatmod) ~ waist, data = BodyFat) %>%
  gf_labs(x = "Waist (in.)") %>%
  gf_lm() %>%
  gf_smooth(col = "red")
  
# equiv of Figure 23.6 note that Figure 23.6 refers to the diamonds dataset
gf_point(resid(BodyFatmod) ~ fitted(BodyFatmod), data = BodyFat) %>%
  gf_labs(x = "Predicted values", y = "Residuals") %>%
  gf_lm() %>%
  gf_smooth(col = "red")

# Figure on bottom of page 650
gf_qq(~ resid(BodyFatmod), col = "royalblue2", alpha = 0.7) %>%
  gf_qqline() %>%
  gf_labs(x = "qnorm", y = "resid(BodyFatmod)")
```


#### Section 23.6: Confidence intervals for predicted values

We can reproduce Figure 23.12 (page 662) using the `gf_lm(interval = )` function.

```{r}
gf_point(PctBF ~ waist, data = BodyFat, cex = 0.5) %>%
  gf_lm(interval = "prediction", col = "blue", fill = "green") %>%
  gf_lm(interval = "confidence", fill = "purple")
```

```{r}
Craters <- read.csv("https://nhorton.people.amherst.edu/sdm4/data/Craters.csv")
dim(Craters)
Craters <- mutate(Craters,
                  logDiam = log(Diam.km.),
                  logAge = log(age..Ma.))
Cratermod <- lm(logDiam ~ logAge, data = Craters)
df_stats(~ logAge, data = Craters) # note example in book has n=39
```

```{r}
confpred <- predict(Cratermod, interval = "confidence")
intpred <- predict(Cratermod, interval = "prediction")
select(Craters, -Name) %>% 
  head(., 3)
head(confpred, 3)
head(intpred, 3)
```

#### Section 23.7: Logistic regression

The Pima Indian dataset example is given on pages 663-667.
```{r message = FALSE}
Pima <- read_csv("https://nhorton.people.amherst.edu/sdm4/data/Pima_Indians_Diabetes.csv")
Diabetes <- filter(Pima, BMI > 0)  # get rid of missing values for BMI
gf_boxplot(BMI ~ as.factor(Diabetes), color = ~as.factor(Diabetes), data = Pima) %>%
  gf_labs(x = "Diabetes")
```

```{r fig.keep="last"}
pimamod <- glm(Diabetes ~ BMI, family = "binomial", data = Pima)
f2 <- makeFun(pimamod)
gf_point(Diabetes ~ BMI, data = Pima, alpha = 0.6, col = "red") %>%
  gf_fun(pimamod)
msummary(pimamod)
```