---
title: "IS5 in R: Understanding and Comparing Distributions (Chapter 4)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "January 10, 2021"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 5
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
library(mosaic)
library(readr)
library(janitor)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE, # display code as typed
  size = "small" # slightly smaller font for code
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

## Chapter 4: Understanding and Comparing Distributions

```{r message = FALSE}
library(mosaic)
library(readr)
library(janitor)
HopkinsForest <- read_csv("http://nhorton.people.amherst.edu/is5/data/Hopkins_Forest.csv") %>%
  janitor::clean_names()
names(HopkinsForest)
```

By default, `read_csv()` prints the variable names.
We suppressed these using the `message = FALSE` code chunk option to save space and improve readability.
Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). You can use the `names()` function to check the cleaned names.

```{r}
# Figure 4.1, page 96
gf_histogram(~avg_wind_mph,
  data = HopkinsForest,
  xlab = "Average Wind Speed (mph)", ylab = "# of Days", binwidth = 0.5, center = 0.25
)
df_stats(~avg_wind_mph, data = HopkinsForest)   # an improved version of "favstats()"
```

### Section 4.1: Displays for Comparing Groups

#### Histograms

We began by creating a new month to categorize the dates.

```{r}
HopkinsForest <- HopkinsForest %>%
  mutate(catmonth = ifelse(month <= 9 & month >= 4, "Spring/Summer", "Fall/Winter"))
```

```{r}
# Figure 4.2, page 96
gf_histogram(~avg_wind_mph,
  data = HopkinsForest, binwidth = 0.5, center = 0.25,
  xlab = "Average Wind Speed (mph)", ylab = "# of Days"
) %>%
  gf_facet_wrap(~catmonth)
df_stats(avg_wind_mph ~ catmonth, data = HopkinsForest)
```

#### Example 4.1: Comparing Groups with Stem-And-Leaf

We begin by reading in the data.

```{r message = FALSE}
# Figure 4.1, page 97
NestEgg <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nest_Egg_Index.csv") %>%
  janitor::clean_names()
with(NestEgg, stem(nest_egg_index))
```

#### Boxplots

As noted in the book, boxplots are most useful to compare distributions.
Below, we have replicated the single boxplot from page 98.

```{r}
# Step 4 on page 98
gf_boxplot(~avg_wind_mph, data = HopkinsForest) %>% # or gf_boxplot(X ~ 1)
  gf_labs(y = "Daily Average Wind Speed (mph)") 
```

I don't recommend the use of single boxplots.
Instead, one can make comparisons more easily by placing boxplots side by side with the following code: 

```{r}
# Figure 4.3, page 99
gf_boxplot(avg_wind_mph ~ as.factor(month), data = HopkinsForest) %>%
  gf_labs(x = "Month", y = "Average wind speed (mph)")
```

We use the `as.factor()` function to convert a variable into a factor.  
  
We also use `gf_labs()` to clean up the code for the first line and improve readability.  

Here we use the mosaic modeling language to specify the variables.  As a general form, `GOAL(Y ~ X)` carries out a specific goal for Y as a function of X.

#### Example 4.2: Comparing Groups with Boxplots

We begin by reading in the data.

```{r message = FALSE}
# Example 4.2, page 99
Coasters <- read_csv("http://nhorton.people.amherst.edu/is5/data/Coasters_2015.csv")
gf_boxplot(Speed ~ Track, data = Coasters)
```

#### Step-By-Step Example: Comparing Groups

We begin by reading in the data.

```{r}
Cups <- read_csv("http://nhorton.people.amherst.edu/is5/data/Cups.csv")
df_stats(Difference ~ Container, data = Cups)
# Mechanics, page 101
gf_boxplot(Difference ~ Container, data = Cups, ylab = "Temp Change in F")
```

#### Just Checking

We begin by reading in the data.

```{r, warning=FALSE, message = FALSE, fig.width = 7}
Flights <- read_csv("http://nhorton.people.amherst.edu/is5/data/Flights_on_time_2016.csv") %>%
  janitor::clean_names()
# Let's improve the ordering of the months (by default they are alphabetical!)
Flights <- Flights %>%
  mutate(month = forcats::fct_relevel(
    month,
    "January", "February", "March", "April",
    "May", "June", "July", "August",
    "September", "October", "November", "December"
    )
  )
# Bureau of Transportation Statistics, page 101
gf_histogram(~ontime_pct, data = Flights, binwidth = 2, center = 1) %>%
  gf_labs(x = "Ontime %", y = "Number of Months")
gf_boxplot(~ontime_pct, data = Flights)
gf_boxplot(ontime_pct ~ month, data = Flights)   # now they are in order!
```

#### Random Matters

We begin by reading in the data.

```{r message = FALSE}
# Figure 4.4, page 102
CarSpeeds <- read_csv("http://nhorton.people.amherst.edu/is5/data/Car_speeds.csv")
gf_boxplot(speed ~ direction, data = CarSpeeds)
```

### Section 4.3: Re-Expressing Data: A First Look

#### Re-Expressing to Improve Symmetry

We begin by reading in the data.

```{r message = FALSE}
CEOComp <- read_csv("http://nhorton.people.amherst.edu/is5/data/CEO_Compensation_2014.csv") %>%
  janitor::clean_names()
```

```{r}
# Figure 4.6, page 105
gf_histogram(~ceo_compensation_m, data = CEOComp, binwidth = 2.5, center = 2.5 / 2) %>%
  gf_labs(x = "Compensation (M$)", y = "Millions of $")
gf_boxplot(~ceo_compensation_m, data = CEOComp) %>%
  gf_labs(x = "Compensation (M$)", y = "Millions of $")

# Figure 4.7, page 106
gf_histogram(~ log(ceo_compensation_m), data = CEOComp, binwidth = .224, center = .112) %>%
  gf_labs(x = "Log (compensation)", y = "# of CEOs")
```

#### Re-Expression to Equalize Spread Across Groups

We begin by reading in the data.

```{r}
PassiveSmoke <- read_csv("http://nhorton.people.amherst.edu/is5/data/Passive_smoke.csv")
```

```{r}
# Figure 4.8, page 107
gf_boxplot(cotinine ~ smoke_exposure, data = PassiveSmoke) %>%
  gf_labs(x = "Smoke Exposure", y = "Cotinine (ng/ml)")
# Figure 4.9
gf_boxplot(log(cotinine) ~ smoke_exposure, data = PassiveSmoke) %>%
  gf_labs(x = "Smoke Exposure", y = "Log(cotinine)")
```