---
title: "IS5 in R: Relationships Between Categorical Variables--Contingency Tables (Chapter 3)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "2025-01-20"
date-format: iso
format: pdf
toc: true
editor: source
---

```{r}
#| label: setup
#| include: false
library(mosaic)   
library(tidyverse)
```


## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

We begin by loading packages that will be required for our analyses.

```{r}
library(mosaic)
library(tidyverse)
```


## Chapter 3: Relationships Between Categorical Variables--Contingency Tables

### Section 3.1: Contingency Tables

```{r}
#| message: false
library(janitor)
OKCupid <-
  read_csv(
    "http://nhorton.people.amherst.edu/is5/data/OKCupid_CatsDogs.csv", 
    skip = 1
  ) |>
  janitor::clean_names()
```

The `read_csv()` function lists the input variable names by default.
These were suppressed using the `message: false` code chunk option to save space.
Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace).
You can use the `names()` function to check the cleaned names.
We use `skip = 1` because the first line in the original data set is a set of variable labels (e.g., `Col1`, `Col2`).

```{r}
names(OKCupid)
```

The `names()` function is an easy way to see what variables are included in a dataset.

```{r}
glimpse(OKCupid)
```

The `glimpse()` function provides more information.

```{r}
# Table 3.1, page 65
tally(~ cats_dogs_both + gender, margin = TRUE, useNA = "no", data = OKCupid)
# Table 3.2
tally(~ cats_dogs_both + gender,
  format = "percent", margin = TRUE, useNA = "no",
  data = OKCupid
)
tally(cats_dogs_both ~ gender,
  format = "percent", margin = TRUE, useNA = "no",
  data = OKCupid
)
# Table 3.3
tally(gender ~ cats_dogs_both, format = "percent", margin = TRUE, data = OKCupid)
```

We note that the logical values `TRUE` and `FALSE` are all caps in R, but that code chunk options are all lower-case (e.g., `message: false`).


#### Example 3.1: Exploring Marginal Distributions

We begin by reading and tallying the data.

```{r}
#| message: false
SuperBowl <-
  read_csv(
    "http://nhorton.people.amherst.edu/is5/data/Watch_the_Super_bowl.csv",
    skip = 1
  )
tally(~ Plan + Sex, data = SuperBowl)
```

#### Example 3.2: Exploring Percentages: Children and First-Class Ticket Holders First?

We do the same for the Titanic data.

```{r}
#| message: false
Titanic <- read_csv("http://nhorton.people.amherst.edu/is5/data/Titanic.csv")
tally(~ Class + Survived, format = "percent", margin = TRUE, data = Titanic)
tally(Class ~ Survived, format = "percent", margin = TRUE, data = Titanic)
tally(Survived ~ Class, format = "percent", margin = TRUE, data = Titanic)
```

### Section 3.2: Conditional Distributions

See displays on 68-69.

```{r}
#| fig.width: 7
OKdata <- tally(
  cats_dogs_both ~ gender,
  format = "percent", useNA = "no",
  data = OKCupid
) |>
  data.frame()
# Figure 3.2, page 69
gf_col(Freq ~ gender, fill = ~ cats_dogs_both, position = "dodge", data = OKdata) |>
  gf_labs(x = "", y = "", fill = "")
```

#### Example 3.3: Finding Conditional Distributions: Watching the Super Bowl

We can calculate conditional probabilities from tables using `mosaic::tally()`.

```{r}
tally(~ Plan + Sex, margin = TRUE, data = SuperBowl)
tally(Plan ~ Sex, format = "percent", data = SuperBowl)
```

#### Example 3.4: Looking for Associations Between Variables: Still Watching the Super Bowl

```{r}
Superdata <- tally(Plan ~ Sex, format = "percent", data = SuperBowl) |>
  data.frame()
gf_col(Freq ~ Plan, fill = ~Sex, position = "dodge", data = Superdata)
```

#### Examining Contingency Tables

See displays on page 72.  

```{r}
#| message: false
FishDiet <- read_csv("http://nhorton.people.amherst.edu/is5/data/Fish_diet.csv", skip = 1) |>
  janitor::clean_names()
tally(~ diet_counts + cancer_counts, margins = TRUE, data = FishDiet)
```

#### Random Matters

See display on page 74.

```{r}
#| message: false
Nightmares <- 
  read_csv("http://nhorton.people.amherst.edu/is5/data/Nightmares.csv", skip = 1)
glimpse(Nightmares)
Nightmares <- Nightmares |>   # recode the `Dream` variable
  mutate(Dream = ifelse(Dream == "N", "Nightmare", "SweetDreams"))
glimpse(Nightmares)
```

Now we can calculate the contingency table.

```{r}
tally(~ Dream + Side, margins = TRUE, data = Nightmares)
```

### Section 3.3: Displaying Contingency Tables

```{r}
tally(~ Class + Survived, format = "count", data = Titanic)
tally(~ Class + Survived, format = "percent", data = Titanic)
# Figure 3.4, page 75
gf_percents(~ Class, fill = ~ Survived, position = position_dodge(), data = Titanic)
# Figure 3.5
gf_percents(~ Survived, fill = ~ Class, position = "fill", data = Titanic)
```

```{r}
#| fig.width: 7
#| fig.height: 7
# Figure 3.6, page 76
vcd::mosaic(tally(~ Survived + Class, data = Titanic),
  main = "Mosaic plot of Class by Survival",
  shade = TRUE
)
```

See the mosaic plots on page 77.  

### Section 3.4: Three Categorical Variables

```{r}
tally(~ gender + cats_dogs_both + drugs_y_n, format = "percent", data = OKCupid)
```

#### Example 3.7: Looking for Associations Among Three Variables at Once

We can repeat the mosaic plot with three variables.

```{r}
#| fig.height: 5
#| fig.width: 7
vcd::mosaic(tally(~ Sex + Survived + Class, data = Titanic), shade = TRUE)
```

#### Example 3.8: Simpson's Paradox: Gender Discrimination?

Here we demonstrate how to generate one of the tables on page 80.  

```{r}
# Create a dataframe from the counts
# http://mathemathinking.blogspot.com/2012/06/simpsons-paradox.html
Berk <- bind_rows(
  do(512) * data.frame(admit = TRUE, sex = "M", school = "A"),
  do(825 - 512) * data.frame(admit = FALSE, sex = "M", school = "A"),
  do(89) * data.frame(admit = TRUE, sex = "F", school = "A"),
  do(19) * data.frame(admit = FALSE, sex = "F", school = "A")
)
```

As noted previously, the logical values `TRUE` and `FALSE` are all caps in R, but that code chunk options are all lower-case (e.g., `message: false`).

Here, the `do(n)` function is used to create `n` observations with the specified values in `data.frame()`.
The `bind_rows()` function can then be used to combine the data frames into one.  

```{r}
tally(~ sex + admit, data = Berk)
tally(admit ~ sex, format = "percent", data = Berk)
```