---
title: "IS5 in R: Relationships Between Categorical Variables--Contingency Tables (Chapter 3)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "December 19, 2020"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 5
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
library(mosaic)
library(readr)
library(janitor)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE, # display code as typed
  size = "small" # slightly smaller font for code
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

## Chapter 3: Relationships Between Categorical Variables--Contingency Tables

### Section 3.1: Contingency Tables

```{r message = FALSE}
library(mosaic)
library(readr)
library(janitor)
OKCupid <-
  read_csv("http://nhorton.people.amherst.edu/is5/data/OKCupid_CatsDogs.csv", skip = 1) %>%
  janitor::clean_names()
names(OKCupid)
```

The `read_csv()` function lists the input variable names by default.
These were suppressed using the `message = FALSE` code chunk option to save space.
Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace).
You can use the `names()` function to check the cleaned names.
We use `skip = 1` because the first line in the original data set is a set of variable labels (e.g., `Col1`, `Col2`).

```{r}
# Table 3.1, page 65
tally(~ cats_dogs_both + gender, margin = TRUE, useNA = "no", data = OKCupid)
# Table 3.2
tally(~ cats_dogs_both + gender,
  format = "percent", margin = TRUE, useNA = "no",
  data = OKCupid
)
tally(cats_dogs_both ~ gender,
  format = "percent", margin = TRUE, useNA = "no",
  data = OKCupid
)
# Table 3.3
tally(gender ~ cats_dogs_both, format = "percent", margin = TRUE, data = OKCupid)
```


#### Example 3.1: Exploring Marginal Distributions

We begin by reading and tallying the data.

```{r message = FALSE}
SuperBowl <-
  read_csv("http://nhorton.people.amherst.edu/is5/data/Watch_the_Super_bowl.csv",
    skip = 1
  )
tally(~ Plan + Sex, data = SuperBowl)
```

#### Example 3.2: Exploring Percentages: Children and First-Class Ticket Holders First?

We do the same for the Titanic data.

```{r message = FALSE}
Titanic <- read_csv("http://nhorton.people.amherst.edu/is5/data/Titanic.csv")
tally(~ Class + Survived, format = "percent", margin = TRUE, data = Titanic)
tally(Class ~ Survived, format = "percent", margin = TRUE, data = Titanic)
tally(Survived ~ Class, format = "percent", margin = TRUE, data = Titanic)
```

### Section 3.2: Conditional Distributions

See displays on 68-69.

```{r fig.width=7}
OKdata <- tally(cats_dogs_both ~ gender,
  format = "percent", useNA = "no",
  data = OKCupid
) %>%
  data.frame()
# Figure 3.2, page 69
gf_col(Freq ~ gender, fill = ~cats_dogs_both, position = "dodge", data = OKdata) %>%
  gf_labs(x = "", y = "", fill = "")
```

#### Example 3.3: Finding Conditional Distributions: Watching the Super Bowl

We can calculate conditional probabilities from tables using `mosaic::tally()`.

```{r}
tally(~ Plan + Sex, margin = TRUE, data = SuperBowl)
tally(Plan ~ Sex, format = "percent", data = SuperBowl)
```

#### Example 3.4: Looking for Associations Between Variables: Still Watching the Super Bowl

```{r}
Superdata <- tally(Plan ~ Sex, format = "percent", data = SuperBowl) %>%
  data.frame()
gf_col(Freq ~ Plan, fill = ~Sex, position = "dodge", data = Superdata)
```

#### Examining Contingency Tables

See displays on page 72.  

```{r message = FALSE}
FishDiet <- read_csv("http://nhorton.people.amherst.edu/is5/data/Fish_diet.csv", skip = 1) %>%
  janitor::clean_names()
tally(~ diet_counts + cancer_counts, margins = TRUE, data = FishDiet)
```

#### Random Matters

See display on page 74.

```{r message = FALSE}
Nightmares <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nightmares.csv", skip = 1)
Nightmares <- Nightmares %>%
  mutate(Dream = ifelse(Dream == "N", "Nightmare", "SweetDreams"))
tally(~ Dream + Side, margins = TRUE, data = Nightmares)
```

### Section 3.3: Displaying Contingency Tables

```{r}
tally(~ Class + Survived, format = "count", data = Titanic)
tally(~ Class + Survived, format = "percent", data = Titanic)
# Figure 3.4, page 75
gf_percents(~Class, fill = ~Survived, position = position_dodge(), data = Titanic)
# Figure 3.5
gf_percents(~Survived, fill = ~Class, position = "fill", data = Titanic)
```

```{r fig.width=7, fig.height = 7}
# Figure 3.6, page 76
vcd::mosaic(tally(~ Survived + Class, data = Titanic),
  main = "Mosaic plot of Class by Survival",
  shade = TRUE
)
```

See the mosaic plots on page 77.  

### Section 3.4: Three Categorical Variables

```{r}
tally(~ gender + cats_dogs_both + drugs_y_n, format = "percent", data = OKCupid)
```

#### Example 3.7: Looking for Associations Among Three Variables at Once

We can repeat the mosaic plot with three variables.

```{r fig.height=5, fig.width=7}
vcd::mosaic(tally(~ Sex + Survived + Class, data = Titanic), shade = TRUE)
```

#### Example 3.8: Simpson's Paradox: Gender Discrimination?

Here we demonstrate how to generate one of the tables on page 80.  

```{r}
# Create a dataframe from the counts
# http://mathemathinking.blogspot.com/2012/06/simpsons-paradox.html
Berk <- rbind(
  do(512) * data.frame(admit = TRUE, sex = "M", school = "A"),
  do(825 - 512) * data.frame(admit = FALSE, sex = "M", school = "A"),
  do(89) * data.frame(admit = TRUE, sex = "F", school = "A"),
  do(19) * data.frame(admit = FALSE, sex = "F", school = "A")
)
```

In this case, `do(n)` creates `n` observations with the specified values in `data.frame()`. The `rbind()` function can then be used to combine the data frames into one.  

```{r}
tally(~ sex + admit, data = Berk)
tally(admit ~ sex, format = "percent", data = Berk)
```