--- title: "IS5 in R: The Standard Deviation as a Ruler and the Normal Model (Chapter 5)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "December 19, 2020" output: pdf_document: fig_height: 3 fig_width: 5 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include = FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages library(mosaic) library(readr) library(janitor) ``` ```{r, include = FALSE} # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy = FALSE, # display code as typed size = "small" # slightly smaller font for code ) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. ## Chapter 5: The Standard Deviation as a Ruler and the Normal Model ```{r message = FALSE} library(mosaic) library(readr) library(janitor) WomenHeptathlon2016 <- read_csv("http://nhorton.people.amherst.edu/is5/data/Womens_Heptathlon_2016.csv") %>% janitor::clean_names() ``` By default, `read_csv()` prints the variable names. These messages were suppressed using the `message = FALSE` code chunk option to save space and improve readability. Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). ```{r} # page 123 df_stats(~long_jump, data = WomenHeptathlon2016) df_stats(~x200m, data = WomenHeptathlon2016) with(WomenHeptathlon2016, stem(x200m)) with(WomenHeptathlon2016, stem(long_jump)) ``` ### Section 5.1: Using the Standard Deviation to Standardize Values ```{r} filter(WomenHeptathlon2016, last_name == "Thiam") %>% tibble() # calculate z-score with mean and sd from df_stats (6.58 - 6.17) / .247 # long jump filter(WomenHeptathlon2016, last_name == "Johnson-Thompson") %>% tibble() ``` The `tibble()` function converts an object into a data frame (you may also see the use of `data.frame()` for this purpose.) ### Section 5.2: Shifting and Scaling #### Shifting to Adjust the Center We begin by reading in the data. ```{r message = FALSE} MenWeight <- read_csv("http://nhorton.people.amherst.edu/is5/data/Mens_Weights.csv") %>% janitor::clean_names() # Figure 5.2, page 125 gf_histogram(~weight_in_kg, data = MenWeight, binwidth = 10, center = 5) %>% gf_labs(x = "Weight (kg)", y = "# of Mean") gf_boxplot(~weight_in_kg, data = MenWeight, xlab = "Weight (kg)") ``` ```{r} df_stats(~weight_in_kg, data = MenWeight) # Figure 5.3 gf_histogram(~ (weight_in_kg - 74), data = MenWeight, binwidth = 10) %>% gf_labs(x = "Kg Above Recommended Weight", y = "# of Men") ``` #### Rescaling to Adjust the Scale Let's review the data from the `MenWeight` dataset. ```{r message=FALSE} df_stats(~weight_in_kg, data = MenWeight) df_stats(~weight_in_pounds, data = MenWeight) library(tidyr) # for gather() function # What does gather() do? MenWeight %>% head() # There are two variables: weight_in_kg and weight_in_pounds. # Each observation has a value for each. nrow(MenWeight) MenLonger <- MenWeight %>% pivot_longer(cols = starts_with("weight"), values_to = "weight", names_to = "weighttype") MenLonger %>% head() # The two variables are weighttype and weight. weighttype is a categorical variable that is either in kg or pounds nrow(MenLonger) # Each observation from before is now two rows ``` Here we use the `tidyr::pivot_wider()` function to transform the dataset into the needed format, which can be seen with the `head()` function. ```{r} MenLonger %>% gf_boxplot(weight ~ weighttype) %>% gf_labs(x = "Weight Type", y = "") ``` We see the use of `goal(Y ~ X)` as an example of the general modeling language for two variables in the `mosaic` package. #### Shifting, Scaling, and the *z*-Scores ### Section 5.3: Normal Models #### The 68-95-99.7 Rule See display on page 129. ```{r} # Figure 5.6 # 1, 2 (1.96), and 3 SD's xpnorm(c(-3, -1.96, -1, 1, 1.96, 3), mean = 0, sd = 1, verbose = FALSE) # 2 (1.96) and 3 SD's xpnorm(c(-3, -1.96, 1.96, 3), mean = 0, sd = 1, verbose = FALSE) # 3 SD's xpnorm(c(-3, 3), mean = 0, sd = 1, verbose = FALSE) ``` #### Example 5.4: Using the 68-95-99.7 Rule We begin by reading in the data. ```{r message = FALSE} BodyFat <- read_csv("http://nhorton.people.amherst.edu/is5/data/Bodyfat.csv") gf_histogram(~Wrist, data = BodyFat, binwidth = .5, center = -.25 ) %>% gf_labs(x = "Wrist Circ (cm)", y = "# of Men") ``` #### Random Matters Starts on page 133. ```{r message = FALSE} Commute <- read_csv("http://nhorton.people.amherst.edu/is5/data/Population_Commute_Times.csv") %>% janitor::clean_names() gf_histogram(~commute_time, data = Commute, binwidth = 10, center = 5) %>% gf_labs(x = "Commute Times (min)", y = "# of Employees") set.seed(2143) # To ensure we get the same values when we run it multiple times numsim <- 10000 # Number of simulations mean(~commute_time, data = sample(Commute, size = 100)) # Mean of one random sample mean(~commute_time, data = sample(Commute, size = 100)) # Mean of another random sample ``` The `mosaic::do()` command allows us to run a command multiple times, saving the result as a data frame. ```{r} do(2) * mean(~commute_time, data = sample(Commute, size = 100)) # For the visualization, we use do() 10,000 times Commute_sample <- do(numsim) * mean(~commute_time, data = sample(Commute, size = 100)) ``` The `do()` function generates 10,000 samples of size 100 and for each calculates the sample mean. ```{r} gf_histogram(~mean, data = Commute_sample) %>% gf_labs(x = "Means of Samples of Size 100", y = "# of Samples") ``` ### Section 5.4: Working with Normal Percentiles The `pnorm()` function calculates normal probabilities. The `xpnorm()` function from the mosaic package adds a graphical depiction and additional output that may be helpful to new users. ```{r} xpnorm(1.8, mean = 0, sd = 1) ``` The `qnorm()` function finds the inverse of normal probabilities. ```{r} xqnorm(0.964, mean = 500, sd = 100) # inverse of pnorm() qnorm(0.964, mean = 0, sd = 1) # what is the z-score? ``` See examples on pages 136-140. ### Section 5.5: Normal Probability Plots We begin by reading in the data. ```{r message = FALSE} Nissan <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nissan.csv") # Figure 5.10, page 141 gf_histogram(~mpg, data = Nissan, binwidth = 1, center = .5) gf_qq(~mpg, data = Nissan, xlab = "Normal Scores") %>% gf_qqline(linetype = "solid", color = "red") ``` ```{r} # Figure 5.11 gf_histogram(~weight_in_kg, data = MenWeight, xlab = "Weights", binwidth = 10, center = 5) gf_qq(~weight_in_kg, data = MenWeight, xlab = "Normal Scores") %>% gf_qqline(linetype = "solid", color = "red") ```