---
title: "IS5 in R: The Standard Deviation as a Ruler and the Normal Model (Chapter 5)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "December 19, 2020"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 5
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
library(mosaic)
library(readr)
library(janitor)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE, # display code as typed
  size = "small" # slightly smaller font for code
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

## Chapter 5: The Standard Deviation as a Ruler and the Normal Model

```{r message = FALSE}
library(mosaic)
library(readr)
library(janitor)
WomenHeptathlon2016 <-
  read_csv("http://nhorton.people.amherst.edu/is5/data/Womens_Heptathlon_2016.csv") %>%
  janitor::clean_names()
```

By default, `read_csv()` prints the variable names.
These messages were suppressed using the `message = FALSE` code chunk option to save space and improve readability. 
Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). 

```{r}
# page 123
df_stats(~long_jump, data = WomenHeptathlon2016)
df_stats(~x200m, data = WomenHeptathlon2016)
with(WomenHeptathlon2016, stem(x200m))
with(WomenHeptathlon2016, stem(long_jump))
```

### Section 5.1: Using the Standard Deviation to Standardize Values

```{r}
filter(WomenHeptathlon2016, last_name == "Thiam") %>%
  tibble()
# calculate z-score with mean and sd from df_stats
(6.58 - 6.17) / .247 # long jump
filter(WomenHeptathlon2016, last_name == "Johnson-Thompson") %>%
  tibble()
```

The `tibble()` function converts an object into a data frame (you may also see the use of `data.frame()` for this purpose.)

### Section 5.2: Shifting and Scaling

#### Shifting to Adjust the Center

We begin by reading in the data.

```{r message = FALSE}
MenWeight <- read_csv("http://nhorton.people.amherst.edu/is5/data/Mens_Weights.csv") %>%
  janitor::clean_names()
# Figure 5.2, page 125
gf_histogram(~weight_in_kg, data = MenWeight, binwidth = 10, center = 5) %>%
  gf_labs(x = "Weight (kg)", y = "# of Mean")
gf_boxplot(~weight_in_kg, data = MenWeight, xlab = "Weight (kg)")
```

```{r}
df_stats(~weight_in_kg, data = MenWeight)
# Figure 5.3
gf_histogram(~ (weight_in_kg - 74), data = MenWeight, binwidth = 10) %>%
  gf_labs(x = "Kg Above Recommended Weight", y = "# of Men")
```

#### Rescaling to Adjust the Scale

Let's review the data from the `MenWeight` dataset.

```{r message=FALSE}
df_stats(~weight_in_kg, data = MenWeight)
df_stats(~weight_in_pounds, data = MenWeight)
library(tidyr) # for gather() function

# What does gather() do?
MenWeight %>%
  head() # There are two variables: weight_in_kg and weight_in_pounds. 
# Each observation has a value for each.
nrow(MenWeight)
MenLonger <- MenWeight %>%
  pivot_longer(cols = starts_with("weight"), 
               values_to = "weight",
               names_to = "weighttype")
MenLonger %>%
  head() # The two variables are weighttype and weight. weighttype is a categorical variable that is either in kg or pounds
nrow(MenLonger) # Each observation from before is now two rows
```

Here we use the `tidyr::pivot_wider()` function to transform the dataset into the needed format, which can be seen with the `head()` function.  

```{r}
MenLonger %>%
  gf_boxplot(weight ~ weighttype) %>%
  gf_labs(x = "Weight Type", y = "")
```

We see the use of `goal(Y ~ X)` as an example of the general modeling language for two variables in the `mosaic` package.


#### Shifting, Scaling, and the *z*-Scores

### Section 5.3: Normal Models 

#### The 68-95-99.7 Rule

See display on page 129.  

```{r}
# Figure 5.6
# 1, 2 (1.96), and 3 SD's
xpnorm(c(-3, -1.96, -1, 1, 1.96, 3), mean = 0, sd = 1, verbose = FALSE)
# 2 (1.96) and 3 SD's
xpnorm(c(-3, -1.96, 1.96, 3), mean = 0, sd = 1, verbose = FALSE)
# 3 SD's
xpnorm(c(-3, 3), mean = 0, sd = 1, verbose = FALSE)
```

#### Example 5.4: Using the 68-95-99.7 Rule

We begin by reading in the data.

```{r message = FALSE}
BodyFat <- read_csv("http://nhorton.people.amherst.edu/is5/data/Bodyfat.csv")
gf_histogram(~Wrist,
  data = BodyFat, binwidth = .5,
  center = -.25
) %>%
  gf_labs(x = "Wrist Circ (cm)", y = "# of Men")
```

#### Random Matters

Starts on page 133.

```{r message = FALSE}
Commute <-
  read_csv("http://nhorton.people.amherst.edu/is5/data/Population_Commute_Times.csv") %>%
  janitor::clean_names()

gf_histogram(~commute_time, data = Commute, binwidth = 10, center = 5) %>%
  gf_labs(x = "Commute Times (min)", y = "# of Employees")

set.seed(2143) # To ensure we get the same values when we run it multiple times
numsim <- 10000 # Number of simulations


mean(~commute_time, data = sample(Commute, size = 100)) # Mean of one random sample
mean(~commute_time, data = sample(Commute, size = 100)) # Mean of another random sample
```

The `mosaic::do()` command allows us to run a command multiple times, saving the result as a data frame.

```{r}
do(2) * mean(~commute_time, data = sample(Commute, size = 100))

# For the visualization, we use do() 10,000 times
Commute_sample <- do(numsim) * mean(~commute_time, data = sample(Commute, size = 100))
```

The `do()` function generates 10,000 samples of size 100 and for each calculates the sample mean.

```{r}
gf_histogram(~mean, data = Commute_sample) %>%
  gf_labs(x = "Means of Samples of Size 100", y = "# of Samples")
```

### Section 5.4: Working with Normal Percentiles

The `pnorm()` function calculates normal probabilities.  The `xpnorm()` function from the mosaic package adds a graphical depiction and additional output that may be helpful to new users.

```{r}
xpnorm(1.8, mean = 0, sd = 1)
```

The `qnorm()` function finds the inverse of normal probabilities.

```{r}
xqnorm(0.964, mean = 500, sd = 100) # inverse of pnorm()
qnorm(0.964, mean = 0, sd = 1) # what is the z-score?
```

See examples on pages 136-140.

### Section 5.5: Normal Probability Plots

We begin by reading in the data.

```{r message = FALSE}
Nissan <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nissan.csv")
# Figure 5.10, page 141
gf_histogram(~mpg, data = Nissan, binwidth = 1, center = .5)
gf_qq(~mpg, data = Nissan, xlab = "Normal Scores") %>%
  gf_qqline(linetype = "solid", color = "red")
```

```{r}
# Figure 5.11
gf_histogram(~weight_in_kg, data = MenWeight, xlab = "Weights", binwidth = 10, center = 5)
gf_qq(~weight_in_kg, data = MenWeight, xlab = "Normal Scores") %>%
  gf_qqline(linetype = "solid", color = "red")
```