--- title: "IS5 in R: The Standard Deviation as a Ruler and the Normal Model (Chapter 5)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "2025-01-20" date-format: iso format: pdf toc: true editor: source --- ```{r} #| label: setup #| include: false library(mosaic) library(tidyverse) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. We begin by loading packages that will be required for our analyses. ```{r} library(mosaic) library(tidyverse) ``` ## Chapter 5: The Standard Deviation as a Ruler and the Normal Model ```{r} #| message: false library(mosaic) library(readr) library(janitor) WomenHeptathlon2016 <- read_csv("http://nhorton.people.amherst.edu/is5/data/Womens_Heptathlon_2016.csv") |> janitor::clean_names() ``` By default, `read_csv()` prints the variable names. These messages were suppressed using the `message: false` code chunk option to save space and improve readability. Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or white space). ```{r} # page 123 df_stats(~ long_jump, data = WomenHeptathlon2016) df_stats(~ x200m, data = WomenHeptathlon2016) with(WomenHeptathlon2016, stem(x200m)) # the `stem()` function doesn't have a `data = ` option with(WomenHeptathlon2016, stem(long_jump)) ``` ### Section 5.1: Using the Standard Deviation to Standardize Values ```{r} filter(WomenHeptathlon2016, last_name == "Thiam") |> tibble() # calculate z-score with mean and sd from df_stats (6.58 - 6.17) / .247 # long jump filter(WomenHeptathlon2016, last_name == "Johnson-Thompson") |> tibble() ``` The `tibble()` function converts an object into a variant of a "data frame" (you may also see the use of `data.frame()` for this purpose.) Note the difference when we pipe the results of `filter()` into the `data.frame()` function. ```{r} filter(WomenHeptathlon2016, last_name == "Johnson-Thompson") |> data.frame() ``` ### Section 5.2: Shifting and Scaling #### Shifting to Adjust the Center We begin by reading in the data. ```{r} #| message: false MenWeight <- read_csv("http://nhorton.people.amherst.edu/is5/data/Mens_Weights.csv") |> janitor::clean_names() # Figure 5.2, page 125 gf_histogram(~ weight_in_kg, data = MenWeight, binwidth = 10, center = 5) |> gf_labs(x = "Weight (kg)", y = "# of Mean") gf_boxplot(~ weight_in_kg, data = MenWeight, xlab = "Weight (kg)") ``` As noted previously, a single boxplot is not a good way to display the data (boxplots are better for comparisons). ```{r} df_stats(~ weight_in_kg, data = MenWeight) # Figure 5.3 gf_histogram(~ (weight_in_kg - 74), data = MenWeight, binwidth = 10) |> gf_labs(x = "Kg Above Recommended Weight", y = "# of Men") ``` #### Rescaling to Adjust the Scale Let's review the data from the `MenWeight` dataset. ```{r} #| message: false df_stats(~ weight_in_kg, data = MenWeight) df_stats(~ weight_in_pounds, data = MenWeight) MenWeight |> head() # There are two variables: weight_in_kg and weight_in_pounds. # Each observation has a value for each. nrow(MenWeight) MenLonger <- MenWeight |> tidyr::pivot_longer(cols = starts_with("weight"), values_to = "weight", names_to = "weighttype") MenLonger |> head() # The two variables are weighttype and weight. # weighttype is a categorical variable that is either in kg or pounds nrow(MenLonger) # Each observation from before is now two rows ``` Here we use the `tidyr::pivot_wider()` function to transform the dataset into the needed format, which can be seen with the `head()` function. This is an important but more complicated data wrangling idiom that we will use to reshape datasets. ```{r} MenLonger |> gf_boxplot(weight ~ weighttype) |> gf_labs(x = "Weight Type", y = "") ``` We see the use of `goal(Y ~ X)` as an example of the general modeling language for two variables in the `mosaic` package. #### Shifting, Scaling, and the *z*-Scores ### Section 5.3: Normal Models #### The 68-95-99.7 Rule See display on page 129. ```{r} # Figure 5.6 # 1, 2 (1.96), and 3 SD's xpnorm(c(-3, -1.96, -1, 1, 1.96, 3), mean = 0, sd = 1, verbose = FALSE) # 2 (1.96) and 3 SD's xpnorm(c(-3, -1.96, 1.96, 3), mean = 0, sd = 1, verbose = FALSE) # 3 SD's xpnorm(c(-3, 3), mean = 0, sd = 1, verbose = FALSE) ``` #### Example 5.4: Using the 68-95-99.7 Rule We begin by reading in the data. ```{r} #| message: false BodyFat <- read_csv("http://nhorton.people.amherst.edu/is5/data/Bodyfat.csv") gf_histogram( ~ Wrist, data = BodyFat, binwidth = .5, center = -.25 ) |> gf_labs(x = "Wrist Circ (cm)", y = "# of Men") ``` #### Random Matters Starts on page 133. ```{r} #| message: false Commute <- read_csv("http://nhorton.people.amherst.edu/is5/data/Population_Commute_Times.csv") |> janitor::clean_names() gf_histogram(~ commute_time, data = Commute, binwidth = 10, center = 5) |> gf_labs(x = "Commute Times (min)", y = "# of Employees") set.seed(2143) # To ensure we get the same values when we run it multiple times num_sim <- 10000 # Number of simulations samp_size <- 100 # Desired sample size mean(~ commute_time, data = sample(Commute, size = samp_size)) # Mean of one random sample mean(~ commute_time, data = sample(Commute, size = samp_size)) # Mean of another random sample ``` The `mosaic::do()` command allows us to run a command multiple times, saving the result as a data frame. ```{r} do(2) * mean(~ commute_time, data = sample(Commute, size = samp_size)) # For the visualization, we use do() 10,000 times Commute_sample <- do(num_sim) * mean(~commute_time, data = sample(Commute, size = samp_size)) ``` The `do()` function generates 10,000 samples of size `samp_size` and for each calculates the sample mean. ```{r} gf_histogram(~ mean, data = Commute_sample) |> gf_labs(x = "Means of Samples of Size 100", y = "# of Samples") ``` ### Section 5.4: Working with Normal Percentiles The `pnorm()` function calculates normal probabilities. The `xpnorm()` function from the mosaic package adds a graphical depiction and additional output that may be helpful to new users. ```{r} xpnorm(1.8, mean = 0, sd = 1) ``` The `qnorm()` function finds the inverse of normal probabilities. ```{r} xqnorm(0.964, mean = 500, sd = 100) # inverse of pnorm() qnorm(0.964, mean = 0, sd = 1) # what is the z-score? ``` See examples on pages 136-140. ### Section 5.5: Normal Probability Plots We begin by reading in the data. ```{r} #| message: false Nissan <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nissan.csv") # Figure 5.10, page 141 gf_histogram(~ mpg, data = Nissan, binwidth = 1, center = .5) gf_qq(~ mpg, data = Nissan, xlab = "Normal Scores") |> gf_qqline(linetype = "solid", color = "red") ``` ```{r} # Figure 5.11 gf_histogram(~ weight_in_kg, data = MenWeight, xlab = "Weights", binwidth = 10, center = 5) gf_qq(~ weight_in_kg, data = MenWeight, xlab = "Normal Scores") |> gf_qqline(linetype = "solid", color = "red") ```