---
title: "IS5 in R: Displaying and Describing Data (Chapter 2)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "2025-01-20"
date-format: iso
format: pdf
toc: true
editor: source
---

```{r}
#| label: setup
#| include: false
library(mosaic)   
library(tidyverse)
```


## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

We begin by loading packages that will be required for our analyses.

```{r}
library(mosaic)
library(tidyverse)
```

## Chapter 2: Displaying and Describing Data

### Section 2.1: Summarizing and Displaying a Categorical Variable

```{r}
#| message: false
options(digits = 3) # make decimal results easier to read
Titanic <- read_csv("http://nhorton.people.amherst.edu/is5/data/Titanic.csv")
```

By default, the `read_csv()` function (loaded with the `tidyverse` package) prints the variable names.
These messages were suppressed using the `message: false` code chunk option to save space and improve readability.  

```{r}
# Table 2.2, page 19
tally(~ Class, data = Titanic)

# Table 2.3
tally(~ Class, format = "percent", data = Titanic)

# Figure 2.2, page 19
gf_bar(~ Class, data = Titanic) |>
  gf_labs(x = "Ticket Class", y = "Number of People")
```

`GOAL(~ X)` is the general form of the modeling language for one variable in the `mosaic` package.
We use `gf_bar()` to make a bar graph using the `ggformula` system, which is automatically make accessible by loading the `mosaic` package.  

See the Mosaic Minimal Guide for more details:

[https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdf](https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdf)


### Section 2.2: Displaying a Quantitative Variable

#### Ages of Those Aboard the Titanic

```{r}
# Figure 2.7, page 24
gf_histogram(~ Age, data = Titanic, binwidth = 5, ylab = "# of People", center = 5 / 2)
```

The function generates a warning because three of the ages are missing; this output can (and should!) be suppressed by adding `warning: false` as an option in this code chunk.
Note: be sure to double check about a warning before suppressing it!

#### Earthquakes and Tsunamis

We begin by reading in the data.

```{r}
#| message: false
#| warning: false
# Example 2.3, page 25
Earthquakes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Tsunamis_2016.csv")
gf_histogram(~ Primary_Magnitude,
  data = Earthquakes, binwidth = 0.5,
  ylab = "# of Earthquakes", xlab = "Magnitude", center = 0.25
)
```

#### Stem-and-Leaf Displays

See page 26.

```{r}
#| message: false
# Figure 2.8, page 26
Pulse_rates <- read_csv("http://nhorton.people.amherst.edu/is5/data/Pulse_rates.csv")
gf_histogram(~ Pulse, data = Pulse_rates, binwidth = 5, center = 5 / 2)
with(Pulse_rates, stem(Pulse)) # base R function that doesn't have a `data =` option.
```

#### Dotplot

See Figure 2.9, page 27

```{r}
#| message: false
Derby <- readr::read_csv("http://nhorton.people.amherst.edu/is5/data/Kentucky_Derby_2016.csv")
gf_dotplot(~Time_Sec, data = Derby, binwidth = 1) |>
  gf_labs(x = "Winning Time (sec)", y = "# of Races")
```

#### Density Plots

There are two forms of density plots: not-shaded, and shaded.
The former will be useful when comparing multiple densities.

```{r}
#| warning: false
# Figure 2.10, page 27
gf_dens(~ Age, data = Titanic)
gf_density(~ Age, data = Titanic)
```

### Section 2.3: Shape

See displays on pages 28-29.

#### Consumer Price Index

First we need to load the data.

```{r}
#| message: false
CPI <- read_csv("http://nhorton.people.amherst.edu/is5/data/CPI_Worldwide.csv") |>
  janitor::clean_names()
```

The pipe operator (`|>`) takes the output of the first command and uses it as input for the next.  

The variable names coming out of spreadsheet files can sometimes be quite wonky. 
The results from `read_csv()` are passed to the `clean_names()` function from the `janitor` package to make them more consistent.

```{r}
names(CPI)
```

You can use the `names()` function to check the reformatted names.    
(We'll often use the `glimpse()` command to provide even more information about a dataset.)

```{r}
# Example 2.5, page 30
gf_histogram(~consumer_price_index,
  data = CPI, ylab = "# of Cities",
  xlab = "Consumer Price Index", binwidth = 5, center = 5 / 2
)
```

#### Credit Card Expenditures

First we load the data.

```{r}
#! message: false
CreditCardEx <- read_csv("http://nhorton.people.amherst.edu/is5/data/Credit_card_charges.csv") |>
  janitor::clean_names()
# Figure 2.6, page 30
gf_histogram(~charges,
  data = CreditCardEx, ylab = "# of Customers",
  xlab = "Average Monthly Expenditure ($)", binwidth = 400, center = 200
)
```

### Section 2.4: Center

#### Finding Median and Mean

First we need to load the data.

```{r}
#| warning: false
TitanicCrew <- filter(Titanic, Class == "Crew")
# Figure 2.15, page 32
TitanicCrew |>
  mutate(color = ifelse(Age <= median(Age), "Less", "Greater")) |>
  gf_histogram(~ Age, fill = ~color, binwidth = 5, center = 5 / 2, ylab = "# of Crew Members") |>
  gf_labs(fill = "Age compared to median") 
# Figure 2.16
gf_histogram(
  ~Age, 
  data = TitanicCrew, 
  ylab = "# of Crew Members", 
  binwidth = 5, 
  center = 5 / 2) |>
  gf_vline(xintercept = mean(~Age, data = TitanicCrew))
df_stats(~Age, data = TitanicCrew)
```

Another way to generate summary statistics is the `favstats()` command (we will stick to `df_stats()` because it is more flexible).

```{r}
favstats(~Age, data = TitanicCrew)
```

### Section 2.5: Spread

#### The Range

```{r}
range(~Age, data = TitanicCrew)
diff(range(~Age, data = TitanicCrew))
```

The `range()` function returns the maximum and minimum values, so we can use the `diff()`
function to find the difference between the two values.

#### The Interquartile Range

```{r}
df_stats(~ Age, data = TitanicCrew)
IQR(~ Age, data = TitanicCrew)
```

Using the `IQR()` function allows us to avoid having to manually find the 
IQR by subtracting Q1 from Q3 from the `df_stats()` output.  

#### Standard Deviation  

```{r}
sd(~Age, data = TitanicCrew)
var(~Age, data = TitanicCrew)
```

#### Summarizing a Distribution

First we need to load the data.

```{r}
#| message: false
Nissan <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nissan.csv")
# Step-by-Step Example, page 39
gf_histogram(~mpg,
  data = Nissan, binwidth = 1, xlab = "Fuel Efficiency (mpg)",
  ylab = "# of Fill-ups", center = 5 / 2
)
df_stats(~ mpg, data = Nissan)
```

#### Random Matters

First we need to load the data.

```{r}
#| message: false
Commute <- read_csv("http://nhorton.people.amherst.edu/is5/data/Population_Commute_Times.csv") |>
  janitor::clean_names()
# Figure 2.19, page 40
gf_histogram(~ commute_time,
  data = Commute, binwidth = 10, xlab = "Commute Time (min)",
  ylab = "# of Employees", center = 5
)
```