---
title: "IS5 in R: The Standard Deviation as a Ruler and the Normal Model (Chapter 5)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "2025-01-23"
date-format: iso
format: pdf
toc: true
editor: source
---

```{r}
#| label: setup
#| include: false
library(mosaic)   
library(tidyverse)
```


## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

We begin by loading packages that will be required for our analyses.

```{r}
library(mosaic)
library(tidyverse)
```


## Chapter 5: The Standard Deviation as a Ruler and the Normal Model

```{r}
#| message: false
library(mosaic)
library(readr)
library(janitor)
WomenHeptathlon2016 <-
  read_csv("http://nhorton.people.amherst.edu/is5/data/Womens_Heptathlon_2016.csv") |>
  janitor::clean_names()
```

By default, `read_csv()` prints the variable names.
These messages were suppressed using the `message: false` code chunk option to save space and improve readability. 
Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or white space). 

```{r}
# page 123
df_stats(~ long_jump, data = WomenHeptathlon2016)
df_stats(~ x200m, data = WomenHeptathlon2016)
with(WomenHeptathlon2016, stem(x200m)) 
# the `stem()` function doesn't have a `data = ` option
with(WomenHeptathlon2016, stem(long_jump))
```

### Section 5.1: Using the Standard Deviation to Standardize Values

```{r}
filter(WomenHeptathlon2016, last_name == "Thiam") |>
  tibble()
# calculate z-score with mean and sd from df_stats
(6.58 - 6.17) / .247 # long jump
filter(WomenHeptathlon2016, last_name == "Johnson-Thompson") |>
  tibble()
```

The `tibble()` function converts an object into a variant of a "data frame" (you may also see the use of `data.frame()` for this purpose.)

Note the difference when we pipe the results of `filter()` into the `data.frame()` function.

```{r}
filter(WomenHeptathlon2016, last_name == "Johnson-Thompson") |>
  data.frame()
```

### Section 5.2: Shifting and Scaling

#### Shifting to Adjust the Center

We begin by reading in the data.

```{r}
#| message: false
MenWeight <- read_csv("http://nhorton.people.amherst.edu/is5/data/Mens_Weights.csv") |>
  janitor::clean_names()
# Figure 5.2, page 125
gf_histogram(~ weight_in_kg, data = MenWeight, binwidth = 10, center = 5) |>
  gf_labs(x = "Weight (kg)", y = "# of Mean")
gf_boxplot(~ weight_in_kg, data = MenWeight, xlab = "Weight (kg)")
```

As noted previously, a single boxplot is not a good way to display the data (boxplots are better for comparisons).

```{r}
df_stats(~ weight_in_kg, data = MenWeight)
# Figure 5.3
gf_histogram(~ (weight_in_kg - 74), data = MenWeight, binwidth = 10) |>
  gf_labs(x = "Kg Above Recommended Weight", y = "# of Men")
```

#### Rescaling to Adjust the Scale

Let's review the data from the `MenWeight` dataset.

```{r}
#| message: false
df_stats(~ weight_in_kg, data = MenWeight)
df_stats(~ weight_in_pounds, data = MenWeight)

MenWeight |>
  head() # There are two variables: weight_in_kg and weight_in_pounds. 
# Each observation has a value for each.
nrow(MenWeight)
MenLonger <- MenWeight |>
  tidyr::pivot_longer(cols = starts_with("weight"), 
               values_to = "weight",
               names_to = "weighttype")
MenLonger |>
  head() # The two variables are weighttype and weight. 
# weighttype is a categorical variable that is either in kg or pounds
nrow(MenLonger) # Each observation from before is now two rows
```

Here we use the `tidyr::pivot_wider()` function to transform the dataset into the needed format, which can be seen with the `head()` function.
This is an important but more complicated data wrangling idiom that we will use to reshape datasets.

```{r}
MenLonger |>
  gf_boxplot(weight ~ weighttype) |>
  gf_labs(x = "Weight Type", y = "")
```

We see the use of `GOAL(Y ~ X)` as an example of the general modeling language for two variables in the `mosaic` package.


#### Shifting, Scaling, and the *z*-Scores

### Section 5.3: Normal Models 

#### The 68-95-99.7 Rule

See display on page 129.  

```{r}
# Figure 5.6
# 1, 2 (1.96), and 3 SD's
xpnorm(c(-3, -1.96, -1, 1, 1.96, 3), mean = 0, sd = 1, verbose = FALSE)
# 2 (1.96) and 3 SD's
xpnorm(c(-3, -1.96, 1.96, 3), mean = 0, sd = 1, verbose = FALSE)
# 3 SD's
xpnorm(c(-3, 3), mean = 0, sd = 1, verbose = FALSE)
```

#### Example 5.4: Using the 68-95-99.7 Rule

We begin by reading in the data.

```{r}
#| message: false
BodyFat <- read_csv("http://nhorton.people.amherst.edu/is5/data/Bodyfat.csv")
gf_histogram(
  ~ Wrist,
  data = BodyFat, binwidth = .5,
  center = -.25
) |>
  gf_labs(x = "Wrist Circ (cm)", y = "# of Men")
```

#### Random Matters

Starts on page 133.

```{r}
#| message: false
Commute <-
  read_csv("http://nhorton.people.amherst.edu/is5/data/Population_Commute_Times.csv") |>
  janitor::clean_names()

gf_histogram(~ commute_time, data = Commute, binwidth = 10, center = 5) |>
  gf_labs(x = "Commute Times (min)", y = "# of Employees")

set.seed(2143) # To ensure we get the same values when we run it multiple times
num_sim <- 10000 # Number of simulations
samp_size <- 100 # Desired sample size


mean(~ commute_time, data = sample(Commute, size = samp_size)) # Mean of one random sample
mean(~ commute_time, data = sample(Commute, size = samp_size)) # Mean of another random sample
```

The `mosaic::do()` command allows us to run a command multiple times, saving the result as a data frame.

```{r}
do(2) * mean(~ commute_time, data = sample(Commute, size = samp_size))

# For the visualization, we use do() 10,000 times
Commute_sample <- do(num_sim) * mean(~commute_time, data = sample(Commute, size = samp_size))
```

The `do()` function generates 10,000 samples of size `samp_size` and for each calculates the sample mean.

```{r}
gf_histogram(~ mean, data = Commute_sample) |>
  gf_labs(x = "Means of Samples of Size 100", y = "# of Samples")
```

### Section 5.4: Working with Normal Percentiles

The `pnorm()` function calculates normal probabilities.
The `xpnorm()` function from the mosaic package adds a graphical depiction and additional output that may be helpful to new users.

```{r}
xpnorm(1.8, mean = 0, sd = 1)
```

The `qnorm()` function finds the inverse of normal probabilities.

```{r}
xqnorm(0.964, mean = 500, sd = 100) # inverse of pnorm()
qnorm(0.964, mean = 0, sd = 1) # what is the z-score?
```

See examples on pages 136-140.

### Section 5.5: Normal Probability Plots

We begin by reading in the data.

```{r}
#| message: false
Nissan <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nissan.csv")
# Figure 5.10, page 141
gf_histogram(~ mpg, data = Nissan, binwidth = 1, center = .5)
gf_qq(~ mpg, data = Nissan, xlab = "Normal Scores") |>
  gf_qqline(linetype = "solid", color = "red")
```

```{r}
# Figure 5.11
gf_histogram(~ weight_in_kg, data = MenWeight, xlab = "Weights", binwidth = 10, center = 5)
gf_qq(~ weight_in_kg, data = MenWeight, xlab = "Normal Scores") |>
  gf_qqline(linetype = "solid", color = "red")
```