---
title: "IS5 in R: Understanding and Comparing Distributions (Chapter 4)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "2025-01-08"
date-format: iso
format: pdf
toc: true
editor: source
---

```{r}
#| label: setup
#| include: false
library(mosaic)   
library(tidyverse)
```


## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

We begin by loading packages that will be required for our analyses.

```{r}
library(mosaic)
library(tidyverse)
```


## Chapter 4: Understanding and Comparing Distributions

```{r}
#| message: false
library(mosaic)
library(readr)
library(janitor)
HopkinsForest <- 
  read_csv("http://nhorton.people.amherst.edu/is5/data/Hopkins_Forest.csv") |>
  janitor::clean_names()
names(HopkinsForest)
```

By default, `read_csv()` prints the variable names.
We suppressed these using the `message = FALSE` code chunk option to save space and improve readability.
Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). You can use the `names()` function to check the cleaned names.

```{r}
# Figure 4.1, page 96
gf_histogram(~ avg_wind_mph,
  data = HopkinsForest,
  xlab = "Average Wind Speed (mph)", 
  ylab = "# of Days", 
  binwidth = 0.5, 
  center = 0.25
)
df_stats(~ avg_wind_mph, data = HopkinsForest)  # an alternative version of "favstats()"
```

### Section 4.1: Displays for Comparing Groups

#### Histograms

We began by creating a new month to categorize the dates.

```{r}
HopkinsForest <- HopkinsForest |>
  mutate(catmonth = ifelse(
    month <= 9 & month >= 4, 
    "Spring/Summer", 
    "Fall/Winter")
  )
```

```{r}
# Figure 4.2, page 96
gf_histogram(~ avg_wind_mph,
  data = HopkinsForest, binwidth = 0.5, center = 0.25,
  xlab = "Average Wind Speed (mph)", ylab = "# of Days"
) |>
  gf_facet_wrap(~ catmonth)
df_stats(avg_wind_mph ~ catmonth, data = HopkinsForest)
```

#### Example 4.1: Comparing Groups with Stem-And-Leaf

We begin by reading in the data.

```{r}
#| message: false
# Figure 4.1, page 97
NestEgg <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nest_Egg_Index.csv") |>
  janitor::clean_names()
with(NestEgg, stem(nest_egg_index)) # or stem(NestEgg$nest_egg_index)
```

#### Boxplots

As noted in the book, boxplots are most useful to compare distributions.
Below, we have replicated the single boxplot from page 98.

```{r}
# Step 4 on page 98
gf_boxplot(~ avg_wind_mph, data = HopkinsForest) |> # or gf_boxplot(X ~ 1)
  gf_labs(y = "Daily Average Wind Speed (mph)") 
```

The use of single boxplots isn't recommended.
Instead, one can make comparisons more easily by placing boxplots side by side with the following code: 

```{r}
# Figure 4.3, page 99
gf_boxplot(avg_wind_mph ~ as.factor(month), data = HopkinsForest) |>
  gf_labs(x = "Month", y = "Average wind speed (mph)")
```

We use the `as.factor()` function to convert a variable into a factor.  
  
We also use `gf_labs()` to clean up the code for the first line and improve readability.  

Here we use the mosaic modeling language to specify the variables.
The `~` symbol is used to separate the response variable from the explanatory variable.

As a general form, `GOAL(Y ~ X)` carries out a specific goal for Y as a function of X.

#### Example 4.2: Comparing Groups with Boxplots

We begin by reading in the data.

```{r}
#| message: false
# Example 4.2, page 99
Coasters <- read_csv("http://nhorton.people.amherst.edu/is5/data/Coasters_2015.csv")
gf_boxplot(Speed ~ Track, data = Coasters)
```

#### Step-By-Step Example: Comparing Groups

We begin by reading in the data.

```{r}
#| message: false
Cups <- read_csv("http://nhorton.people.amherst.edu/is5/data/Cups.csv")
df_stats(Difference ~ Container, data = Cups)
# Mechanics, page 101
gf_boxplot(Difference ~ Container, data = Cups, ylab = "Temp Change in F")
```

#### Just Checking

We begin by reading in the data.

```{r}
#| warning: false
#| message: false
Flights <- 
  read_csv("http://nhorton.people.amherst.edu/is5/data/Flights_on_time_2016.csv") |>
  janitor::clean_names()
# Let's improve the ordering of the months (by default they are alphabetical!)
Flights <- Flights |>
  mutate(month = forcats::fct_relevel(
    month,
    "January", "February", "March", "April",
    "May", "June", "July", "August",
    "September", "October", "November", "December"
    )
  )
```

Here we use the `fct_relevel()` function from the `forcats` package to reorder the months in the dataset (the default is that the months are ordered alphabetically, which isn't very helpful).
This function is a very useful idiom to remember when you want to reorder factor levels.

```{r}
#| warning: false
#| message: false
#| fig.width: 7
# Bureau of Transportation Statistics, page 101
gf_histogram(~ ontime_pct, data = Flights, binwidth = 2, center = 1) |>
  gf_labs(x = "Ontime %", y = "Number of Months")
gf_boxplot(~ ontime_pct, data = Flights)
gf_boxplot(ontime_pct ~ month, data = Flights)   # now they are in order!
```


#### Random Matters

We begin by reading in the data.

```{r}
#| message: false
# Figure 4.4, page 102
CarSpeeds <- read_csv("http://nhorton.people.amherst.edu/is5/data/Car_speeds.csv")
gf_boxplot(speed ~ direction, data = CarSpeeds)
```

### Section 4.3: Re-Expressing Data: A First Look

#### Re-Expressing to Improve Symmetry

We begin by reading in the data.

```{r}
#| message: false
CEOComp <- 
  read_csv("http://nhorton.people.amherst.edu/is5/data/CEO_Compensation_2014.csv") |>
  janitor::clean_names()
```

```{r}
# Figure 4.6, page 105
gf_histogram(~ ceo_compensation_m, data = CEOComp, binwidth = 2.5, center = 2.5 / 2) |>
  gf_labs(x = "Compensation (M$)", y = "Millions of $")
gf_boxplot(~ ceo_compensation_m, data = CEOComp) |>
  gf_labs(x = "Compensation (M$)", y = "Millions of $")

# Figure 4.7, page 106
gf_histogram(~ log(ceo_compensation_m), data = CEOComp, binwidth = 0.224, center = 0.112) |>
  gf_labs(x = "Log (compensation)", y = "# of CEOs")
```

Here we needed to pick magic numbers for the binwidth (e.g., 0.224) and centering of the histogram so that it matched the results from the book.

#### Re-Expression to Equalize Spread Across Groups

We begin by reading in the data.

```{r}
#| message: false
PassiveSmoke <- read_csv("http://nhorton.people.amherst.edu/is5/data/Passive_smoke.csv")
```

```{r}
# Figure 4.8, page 107
gf_boxplot(cotinine ~ smoke_exposure, data = PassiveSmoke) |>
  gf_labs(x = "Smoke Exposure", y = "Cotinine (ng/ml)")
# Figure 4.9
gf_boxplot(log(cotinine) ~ smoke_exposure, data = PassiveSmoke) |>
  gf_labs(x = "Smoke Exposure", y = "Log(cotinine)")
```