---
title: "IS5 in R: Displaying and Describing Data (Chapter 2)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "December 19, 2020"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 5
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
library(mosaic)
library(readr)
library(janitor)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE, # display code as typed
  size = "small" # slightly smaller font for code
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

## Chapter 2: Displaying and Describing Data

### Section 2.1: Summarizing and Displaying a Categorical Variable

```{r message = FALSE}
library(mosaic)
library(readr)
library(janitor) # for variable names
options(digits = 3)
Titanic <- read_csv("http://nhorton.people.amherst.edu/is5/data/Titanic.csv")
```

By default, `read_csv()` prints the variable names. These messages were suppressed using the `message=FALSE` code chunk option to save space and improve readability.  

```{r}
# Table 2.2, page 19
tally(~Class, data = Titanic)

# Table 2.3
tally(~Class, format = "percent", data = Titanic)

# Figure 2.2, page 19
gf_bar(~Class, data = Titanic) %>%
  gf_labs(x = "Ticket Class", y = "Number of People")
```

`GOAL(~ X)` is the general form of the modeling language for one variable in the `mosaic` package.
We use `gf_bar()` to make a bar graph using the `ggformula` system, which is automatically downloaded with the `mosaic` package.  

See the Minimal Guide for more details: https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdf

### Section 2.2: Displaying a Quantitative Variable

#### Ages of Those Aboard the Titanic

```{r}
# Figure 2.7, page 24
gf_histogram(~Age, data = Titanic, binwidth = 5, ylab = "# of People", center = 5 / 2)
```

The function generates a warning because three of the ages are missing; this output can (and should!) be suppressed by adding `warning=FALSE` as an option in this code chunk.

#### Earthquakes and Tsunamis

We begin by reading in the data.

```{r message = FALSE, warning = FALSE}
# Example 2.3, page 25
Earthquakes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Tsunamis_2016.csv")
gf_histogram(~Primary_Magnitude,
  data = Earthquakes, binwidth = 0.5,
  ylab = "# of Earthquakes", xlab = "Magnitude", center = 0.25
)
```

#### Stem-and-Leaf Displays

See page 26.

```{r message=FALSE}
# Figure 2.8, page 26
Pulse_rates <- read_csv("http://nhorton.people.amherst.edu/is5/data/Pulse_rates.csv")
gf_histogram(~Pulse, data = Pulse_rates, binwidth = 5, center = 5 / 2)
with(Pulse_rates, stem(Pulse))
```

#### Dotplot

See Figure 2.9, page 27

```{r message = FALSE}
Derby <- read_csv("http://nhorton.people.amherst.edu/is5/data/Kentucky_Derby_2016.csv")
gf_dotplot(~Time_Sec, data = Derby, binwidth = 1) %>%
  gf_labs(x = "Winning Time (sec)", y = "# of Races")
```

#### Density Plots

There are two forms of density plots: not-shaded, and shaded.
The former will be useful when comparing multiple densities.

```{r, warning = FALSE}
# Figure 2.10, page 27
gf_dens(~Age, data = Titanic)
gf_density(~Age, data = Titanic)
```

### Section 2.3: Shape

See displays on pages 28-29.

#### Consumer Price Index

First we need to load the data.

```{r message = FALSE}
CPI <- read_csv("http://nhorton.people.amherst.edu/is5/data/CPI_Worldwide.csv") %>%
  janitor::clean_names()
names(CPI)
# Example 2.5, page 30
gf_histogram(~consumer_price_index,
  data = CPI, ylab = "# of Cities",
  xlab = "Consumer Price Index", binwidth = 5, center = 5 / 2
)
```

We can use `clean_names()` from the `janitor` package to format the names of the columns when necessary. You can use the `names()` function to check the reformatted names.    
The pipe operator (`%>%`) takes the output of the line of code and uses it in the next.  

#### Credit Card Expenditures

First we load the data.

```{r message = FALSE}
CreditCardEx <- read_csv("http://nhorton.people.amherst.edu/is5/data/Credit_card_charges.csv") %>%
  janitor::clean_names()
# Figure 2.6, page 30
gf_histogram(~charges,
  data = CreditCardEx, ylab = "# of Customers",
  xlab = "Average Monthly Expenditure ($)", binwidth = 400, center = 200
)
```

### Section 2.4: Center

#### Finding Median and Mean

First we need to load the data.

```{r warning = FALSE}
TitanicCrew <- filter(Titanic, Class == "Crew")
# Figure 2.15, page 32
TitanicCrew %>%
  mutate(color = ifelse(Age <= median(Age), "Less", "Greater")) %>%
  gf_histogram(~Age, fill = ~color, binwidth = 5, center = 5 / 2, ylab = "# of Crew Members") %>%
  gf_labs(fill = "Age compared to median") 
# Figure 2.16
gf_histogram(
  ~Age, 
  data = TitanicCrew, 
  ylab = "# of Crew Members", 
  binwidth = 5, 
  center = 5 / 2) %>%
  gf_vline(xintercept = mean(~Age, data = TitanicCrew))
df_stats(~Age, data = TitanicCrew)
```

Another way to generate summary statistics is the `favstats()` command (we will stick to `df_stats()` because it is more flexible).

```{r}
favstats(~Age, data = TitanicCrew)
```

### Section 2.5: Spread

#### The Range

```{r}
range(~Age, data = TitanicCrew)
diff(range(~Age, data = TitanicCrew))
```

The `range()` function returns the maximum and minimum values, so we can use the `diff()`
function to find the difference between the two values.

#### The Interquartile Range

```{r}
df_stats(~Age, data = TitanicCrew)
IQR(~Age, data = TitanicCrew)
```

Using the `IQR()` function allows us to avoid having to manually find the 
IQR by subtracting Q1 from Q3 from the `df_stats()` output.  

#### Standard Deviation  

```{r}
sd(~Age, data = TitanicCrew)
var(~Age, data = TitanicCrew)
```

#### Summarizing a Distribution

First we need to load the data.

```{r message = FALSE}
Nissan <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nissan.csv")
# Step-by-Step Example, page 39
gf_histogram(~mpg,
  data = Nissan, binwidth = 1, xlab = "Fuel Efficiency (mpg)",
  ylab = "# of Fill-ups", center = 5 / 2
)
df_stats(~mpg, data = Nissan)
```

#### Random Matters

First we need to load the data.

```{r message = FALSE}
Commute <- read_csv("http://nhorton.people.amherst.edu/is5/data/Population_Commute_Times.csv") %>%
  janitor::clean_names()
# Figure 2.19, page 40
gf_histogram(~commute_time,
  data = Commute, binwidth = 10, xlab = "Commute Time (min)",
  ylab = "# of Employees", center = 5
)
```