--- title: "IS5 in R: Understanding and Comparing Distributions (Chapter 4)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "2025-01-08" date-format: iso format: pdf toc: true editor: source --- ```{r} #| label: setup #| include: false library(mosaic) library(tidyverse) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. We begin by loading packages that will be required for our analyses. ```{r} library(mosaic) library(tidyverse) ``` ## Chapter 4: Understanding and Comparing Distributions ```{r} #| message: false library(mosaic) library(readr) library(janitor) HopkinsForest <- read_csv("http://nhorton.people.amherst.edu/is5/data/Hopkins_Forest.csv") |> janitor::clean_names() names(HopkinsForest) ``` By default, `read_csv()` prints the variable names. We suppressed these using the `message = FALSE` code chunk option to save space and improve readability. Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). You can use the `names()` function to check the cleaned names. ```{r} # Figure 4.1, page 96 gf_histogram(~ avg_wind_mph, data = HopkinsForest, xlab = "Average Wind Speed (mph)", ylab = "# of Days", binwidth = 0.5, center = 0.25 ) df_stats(~ avg_wind_mph, data = HopkinsForest) # an alternative version of "favstats()" ``` ### Section 4.1: Displays for Comparing Groups #### Histograms We began by creating a new month to categorize the dates. ```{r} HopkinsForest <- HopkinsForest |> mutate(catmonth = ifelse( month <= 9 & month >= 4, "Spring/Summer", "Fall/Winter") ) ``` ```{r} # Figure 4.2, page 96 gf_histogram(~ avg_wind_mph, data = HopkinsForest, binwidth = 0.5, center = 0.25, xlab = "Average Wind Speed (mph)", ylab = "# of Days" ) |> gf_facet_wrap(~ catmonth) df_stats(avg_wind_mph ~ catmonth, data = HopkinsForest) ``` #### Example 4.1: Comparing Groups with Stem-And-Leaf We begin by reading in the data. ```{r} #| message: false # Figure 4.1, page 97 NestEgg <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nest_Egg_Index.csv") |> janitor::clean_names() with(NestEgg, stem(nest_egg_index)) # or stem(NestEgg$nest_egg_index) ``` #### Boxplots As noted in the book, boxplots are most useful to compare distributions. Below, we have replicated the single boxplot from page 98. ```{r} # Step 4 on page 98 gf_boxplot(~ avg_wind_mph, data = HopkinsForest) |> # or gf_boxplot(X ~ 1) gf_labs(y = "Daily Average Wind Speed (mph)") ``` The use of single boxplots isn't recommended. Instead, one can make comparisons more easily by placing boxplots side by side with the following code: ```{r} # Figure 4.3, page 99 gf_boxplot(avg_wind_mph ~ as.factor(month), data = HopkinsForest) |> gf_labs(x = "Month", y = "Average wind speed (mph)") ``` We use the `as.factor()` function to convert a variable into a factor. We also use `gf_labs()` to clean up the code for the first line and improve readability. Here we use the mosaic modeling language to specify the variables. The `~` symbol is used to separate the response variable from the explanatory variable. As a general form, `GOAL(Y ~ X)` carries out a specific goal for Y as a function of X. #### Example 4.2: Comparing Groups with Boxplots We begin by reading in the data. ```{r} #| message: false # Example 4.2, page 99 Coasters <- read_csv("http://nhorton.people.amherst.edu/is5/data/Coasters_2015.csv") gf_boxplot(Speed ~ Track, data = Coasters) ``` #### Step-By-Step Example: Comparing Groups We begin by reading in the data. ```{r} #| message: false Cups <- read_csv("http://nhorton.people.amherst.edu/is5/data/Cups.csv") df_stats(Difference ~ Container, data = Cups) # Mechanics, page 101 gf_boxplot(Difference ~ Container, data = Cups, ylab = "Temp Change in F") ``` #### Just Checking We begin by reading in the data. ```{r} #| warning: false #| message: false Flights <- read_csv("http://nhorton.people.amherst.edu/is5/data/Flights_on_time_2016.csv") |> janitor::clean_names() # Let's improve the ordering of the months (by default they are alphabetical!) Flights <- Flights |> mutate(month = forcats::fct_relevel( month, "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December" ) ) ``` Here we use the `fct_relevel()` function from the `forcats` package to reorder the months in the dataset (the default is that the months are ordered alphabetically, which isn't very helpful). This function is a very useful idiom to remember when you want to reorder factor levels. ```{r} #| warning: false #| message: false #| fig.width: 7 # Bureau of Transportation Statistics, page 101 gf_histogram(~ ontime_pct, data = Flights, binwidth = 2, center = 1) |> gf_labs(x = "Ontime %", y = "Number of Months") gf_boxplot(~ ontime_pct, data = Flights) gf_boxplot(ontime_pct ~ month, data = Flights) # now they are in order! ``` #### Random Matters We begin by reading in the data. ```{r} #| message: false # Figure 4.4, page 102 CarSpeeds <- read_csv("http://nhorton.people.amherst.edu/is5/data/Car_speeds.csv") gf_boxplot(speed ~ direction, data = CarSpeeds) ``` ### Section 4.3: Re-Expressing Data: A First Look #### Re-Expressing to Improve Symmetry We begin by reading in the data. ```{r} #| message: false CEOComp <- read_csv("http://nhorton.people.amherst.edu/is5/data/CEO_Compensation_2014.csv") |> janitor::clean_names() ``` ```{r} # Figure 4.6, page 105 gf_histogram(~ ceo_compensation_m, data = CEOComp, binwidth = 2.5, center = 2.5 / 2) |> gf_labs(x = "Compensation (M$)", y = "Millions of $") gf_boxplot(~ ceo_compensation_m, data = CEOComp) |> gf_labs(x = "Compensation (M$)", y = "Millions of $") # Figure 4.7, page 106 gf_histogram(~ log(ceo_compensation_m), data = CEOComp, binwidth = 0.224, center = 0.112) |> gf_labs(x = "Log (compensation)", y = "# of CEOs") ``` Here we needed to pick magic numbers for the binwidth (e.g., 0.224) and centering of the histogram so that it matched the results from the book. #### Re-Expression to Equalize Spread Across Groups We begin by reading in the data. ```{r} #| message: false PassiveSmoke <- read_csv("http://nhorton.people.amherst.edu/is5/data/Passive_smoke.csv") ``` ```{r} # Figure 4.8, page 107 gf_boxplot(cotinine ~ smoke_exposure, data = PassiveSmoke) |> gf_labs(x = "Smoke Exposure", y = "Cotinine (ng/ml)") # Figure 4.9 gf_boxplot(log(cotinine) ~ smoke_exposure, data = PassiveSmoke) |> gf_labs(x = "Smoke Exposure", y = "Log(cotinine)") ```