---
title: "IPS9 in R: Looking at Data -- Relationships (Chapter 2)"
author: "Margaret Chien and Nicholas Horton (nhorton@amherst.edu)"
date: "July 26, 2018"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 5
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
require(mosaic)
require(vcd) # for the mosaicplot
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
knitr::opts_chunk$set(
  tidy = FALSE,     # display code as typed
  size = "small"    # slightly smaller font for code
)
```

## Introduction and background 

These documents are intended to help describe how to undertake analyses introduced 
as examples in the Ninth Edition of \emph{Introduction to the Practice of Statistics} (2017) by Moore, McCabe, and Craig.

More information about the book can be found [here](https://macmillanlearning.com/Catalog/product/introductiontothepracticeofstatistics-ninthedition-moore).
The data used in these documents can be found under Data Sets in the [Student Site](https://www.macmillanlearning.com/catalog/studentresources/ips9e?_ga=2.29224888.526668012.1531487989-1209447309.1529940008#). This
file as well as the associated R Markdown reproducible analysis source file used to create it can be found at https://nhorton.people.amherst.edu/ips9/.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic).  A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.
  
## Chapter 2: Looking at Data -- Relationships
This file replicates the analyses from Chapter 2: Looking at Data -- Relationships.

First, load the packages that will be needed for this document: 
```{r load-packages}
library(mosaic)
library(readr)
library(vcd)
```

### Section 2.1: Relationships 

### Section 2.2: Scatterplots

#### Example 2.8: Laundry detergents

```{r}
# page 85
Laundry <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-008LAUNDRY.csv")
```

We use `read_csv()` from the `readr` package to import data into R. We can use `message=FALSE` to suppress the warning messages for readability.    

```{r}
# 2.10: Examine the spreadsheet
favstats(~ Rating, data = Laundry)
```

The `favstats()` function summarizes useful statistics of variables.  

```{r}
# 2.11: Use the data set
gf_histogram(~ Rating, data = Laundry, binwidth = 5)
favstats(~ Price, data = Laundry)
gf_histogram(~ Price, data = Laundry, binwidth = 5)
```

#### Example 2.9: Laundry detergents

```{r}
Laundry <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-009LAUNDRY.csv")
# Figure 2.1, page 86
gf_point(Rating ~ Price, data = Laundry, xlab = "Price per load (cents)")
# Figure 2.2
gf_point(Rating ~ Price, data = Laundry, xlab = "Price per load (cents)") %>%
  gf_lims(x = c(0, 35))
```

```{r}
# 2.12: Make a scatterplot (page 87)
gf_jitter(Rating ~ Price, data = Laundry, xlab = "Price per load (cents)") %>%
  gf_lims(x = c(0, 35))
```

We can use `gf_jitter()` to add some noise into the plot to show overlapped points. We also use `gf_lims()` to set x-axis limits on the plot.  

```{r}
# 2.13: Change the units
Laundry2 <- Laundry %>%
  mutate(Price = Price/100)
favstats(~ Price, data = Laundry2)
gf_point(Rating ~ Price, data = Laundry2, xlab = "Price per load (dollars)")
```

We use `mutate()` to create new variables in a dataset.  

#### Example 2.10: Scatterplot with a straight line

```{r}
Laundry <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-010LAUND.csv")
# Figure 2.3, page 88
gf_point(Rating ~ Price, data = Laundry, xlab = "Price per load (cents)") %>%
  gf_lm() 
```

#### Example 2.11: Education spending and population: Benchmarking

```{r}
EduSpending <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-011EDSPEND.csv")
head(EduSpending)
```

We can use the `head()` function to look at the first rows of a data set.  

```{r}
# Figure 2.5, page 90
gf_point(Spending ~ Population, data = EduSpending) %>%
  gf_lm() %>%
  gf_labs(x = "Population (millions)", y = "Spending ($ billion)")
```

#### Example 2.12: Calcium retention

```{r}
Calcium <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-012CALCIUM.csv")
# Figure 2.6, page 90
gf_point(CaRetention~ CaIntake, data = Calcium) %>%
  gf_smooth(span = 1) %>%
  gf_labs(x = "Calcium intake (mg/d)", y = "Calcium retention (mg/d)")
```

#### Example 2.13: Calcium retention with logarithms

```{r}
Calcium <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-013CALCIUM.csv")
# Figure 2.7, page 91
gf_point(LogRet ~ CaIntake, data = Calcium) %>%
  gf_lm() %>%
  gf_labs(x = "Calcium intake (mg/d)", y = "Log calcium retention")
```

#### Example 2.14: Education spending and population with logarithms

```{r}
EduSpending <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-014EDSPEND.csv")
EduSpending <- EduSpending %>%
  mutate(LogPop = log(Population), LogSpend = log(Spending))
# Figure 2.8, page 92
gf_point(LogSpend ~ LogPop, data = EduSpending) %>%
  gf_lm() %>%
  gf_labs(x = "Log population", y = "Log spending")
```

#### Example 2.15: Rating versus price and type of laundry detergent

```{r}
Laundry <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-015LAUND.csv")
# Figure 2.9, page 93
gf_point(Rating ~ Price, color = ~ Type, data = Laundry, xlab = "Price per load (cents)") %>%
  gf_lm()
```

#### Example 2.16: Laundry rating versus price with a smooth fit

```{r}
Laundry <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-016LAUND.csv")
# Figure 2.10, page 94-95
gf_point(Rating ~ Price, data = Laundry) %>%
  gf_smooth(span = .5)
gf_point(Rating ~ Price, data = Laundry) %>%
  gf_smooth(span = 5)
```

#### Example 2.17: A smooth fit for education spending and population with logs

```{r}
# Figure 2.8, page 92
gf_point(LogSpend ~ LogPop, data = EduSpending) %>%
  gf_smooth() %>%
  gf_labs(x = "Log population", y = "Log spending")
```

### Section 2.3: Correlation

#### Use Your Knowledge: Laundry detergents

```{r}
# page 102
cor(Rating ~ Price, data = Laundry)
```

The `cor()` function finds the correlation of two variables.  

### Section 2.4: Least Squares Regression

#### Example 2.19: Fidgeting and fat gain

```{r}
Fidgeting <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-019FIDGET.csv")
# Figure 2.16, page 108
gf_point(Fat ~ NEA, data = Fidgeting) %>%
  gf_labs(x = "Nonexercise activity (calories)", y = "Fat gain (kilograms)")
```

#### Example 2.20: Regression line for fat gain

```{r}
Fidgeting <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-020FIDGET.csv")
# Figure 2.17, page 109
gf_point(Fat ~ NEA, data = Fidgeting) %>%
  gf_lm() %>%
  gf_labs(x = "Nonexercise activity (calories)", y = "Fat gain (kilograms)")
# Use Your Knowledge 2.61: Plot the line
gf_point(Fat ~ NEA, data = Fidgeting) %>%
  gf_abline(slope = -.00344, intercept = 4.505) %>%
  gf_labs(x = "Nonexercise activity (calories)", y = "Fat gain (kilograms)") +
  ylim(0, 5)
```

#### Example 2.21: Prediction for fat gain

```{r}
fatlm <- lm(Fat ~ NEA, data = Fidgeting)
fatfun <- makeFun(fatlm)
fatfun(NEA = 400)
```

We use `makeFun()` to create a function. Here, we make a function from our linear model, created from `lm()`, so we can find the output of a certain value of `NEA`.  

#### Example 2.24: Regression

```{r}
# page 113
msummary(fatlm)
```

The `msummary()` function shows the properties of the function.  

#### Example 2.25: Fidgeting and fat gain

```{r}
# Figure 2.20, page 115 (split into two plots)
gf_point(Fat ~ NEA, data = Fidgeting) %>%
  gf_lm() %>%
  gf_labs(x = "Nonexercise activity (calories)", y = "Fat gain (kilograms)")
gf_point(NEA ~ Fat, data = Fidgeting) %>%
  gf_lm(color = "red") %>%
  gf_labs(x = "Fat gain (kilograms)", y = "Nonexercise activity")

# Models
fatlm 
NEAlm <- lm(NEA ~ Fat, data = Fidgeting)
NEAlm 
```

### Section 2.5: Cautions about Correlation and Regression

#### Example 2.26: Residuals for fat gain

Here, we find the residual:  

```{r}
fatfun(NEA = 135)
2.7 - fatfun(NEA = 135)
```

```{r}
# Figure 2.23, page 124
gf_point(Fat ~ NEA, data = Fidgeting) %>%
  gf_lm() %>%
  gf_labs(x = "Nonexercise activity (calories)", y = "Fat gain (kilograms)")
gf_point(resid(fatlm) ~ NEA, data = Fidgeting) %>%
  gf_lm() %>%
  gf_labs(x = "Nonexercise activity (calories)", y = "Residual")
```

#### Example 2.27: Patterns in birthrate and Internet user residuals

```{r}
IntBirth <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-027INBIRTH.csv")
intbirthlm <- lm(BirthRate2011 ~ Users, data = IntBirth)
# Figure 2.24, page 126
gf_point(BirthRate2011 ~ Users, data = IntBirth) %>%
  gf_lm() %>%
  gf_labs(x = "Internet users per 100 people", y = "Births per 1000 people")
gf_point(resid(intbirthlm) ~ Users, data = IntBirth) %>%
  gf_lm() %>%
  gf_labs(x = "Internet users per 100 people", y = "Residual")
```

#### Example 2.28: Diabetes and blood sugar

```{r}
Diabetes <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-028HBA1C.csv")
diabeteslm <- lm(FPG_mg_ml ~ HbA1c_percent, data = Diabetes)
# Figure 2.25, page 127
gf_point(FPG_mg_ml ~ HbA1c_percent, data = Diabetes) %>%
  gf_lm() %>%
  gf_labs(x = "HbA1c (percent)", y = "Fasting plasma glucose (mg/dl)")
gf_point(resid(diabeteslm) ~ HbA1c_percent, data = Diabetes) %>%
  gf_lm() %>%
  gf_labs(x = "HbA1c (percent)", y = "Residual")
```

#### Example 2.29: Influential observations

We can use the `filter()` function to remove rows from a data set:  

```{r}
without15lm <- lm(FPG_mg_ml ~ HbA1c_percent, data = filter(Diabetes, FPG_mg_ml <= 300)) # model without subject 15
without18lm <- lm(FPG_mg_ml ~ HbA1c_percent, data = filter(Diabetes, HbA1c_percent <= 18)) # model without subject 18

# Figure 2.26, page 129
gf_point(FPG_mg_ml ~ HbA1c_percent, data = Diabetes) %>%
  gf_lm(color = "red") %>%
  gf_fun(without15lm, linetype = 2, color = "red") %>%
  gf_fun(without18lm, linetype = 3, color = "red") %>%
  gf_labs(x = "HbA1c (percent)", y = "Fasting plasma glucose (mg/dl)")
```

### Section 2.6: Data Analysis from Two Way Tables

#### Example 2.33: Is the calcium intake adequate?

```{r}
Calcium <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-033IOM.csv")
Calcium 
```

To create a data set with the structure we want (with each count as an observation), we can use `rbind()`.  

```{r}
# page 137

# Creating data set from counts in the data table
CalciumC <- rbind(
 do(194) * data.frame(Age = "A05to10", Met = "No"),
 do(861) * data.frame(Age = "A05to10", Met = "Yes"), 
 do(557) * data.frame(Age = "A11to13", Met = "No"),
 do(417) * data.frame(Age = "A11to13", Met = "Yes")
)

# Table
tally(Met ~ Age, data = CalciumC)
```

#### Example 2.34: Add the margins to the table

```{r}
tally(Met ~ Age, data = CalciumC, margins = TRUE)
tally(Age ~ Met, data = CalciumC, margins = TRUE)
```

#### Example 2.35: The joint distribution

```{r}
tally(~ Age + Met, data = CalciumC, format = "proportion")
```

#### Example 2.36: The marginal distribution of age

```{r}
tally(~ Age, data = CalciumC, format = "proportion")
```

#### Example 2.37: The marginal distribution of "met requirement"

```{r}
tally(~ Met, data = CalciumC, format = "proportion")
```

#### Example 2.39: Conditional distribution of "met requirement" for children aged 5 to 10

```{r}
# page 141
tally(Met ~ Age, data = CalciumC, format = "proportion")
```

#### Example 2.40: Software output

We can use the `mosaic()` function from the `vcd` package to create mosaic plots with color.  

```{r}
# Figure 2.28 mosaic plot (page 143)
vcd::mosaic(~ Met + Age, data = CalciumC, shade = TRUE)
```

#### Example 2.41: Which customer service representative is better?

```{r}
# page 143
CustomerService <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-041CUSTSER.csv")
CustomerService %>%
  select(Rep, GoalMet, Count)
```

#### Example 2.42: Look at the data more carefully

```{r}
CustomerService %>%
  select(Rep, GoalMet_1, Count_1, GoalMet_2, Count_2)
```

### Section 2.7: The Question of Causation