--- title: "IS5 in R: Relationships Between Categorical Variables--Contingency Tables (Chapter 3)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "2025-01-20" date-format: iso format: pdf toc: true editor: source --- ```{r} #| label: setup #| include: false library(mosaic) library(tidyverse) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. We begin by loading packages that will be required for our analyses. ```{r} library(mosaic) library(tidyverse) ``` ## Chapter 3: Relationships Between Categorical Variables--Contingency Tables ### Section 3.1: Contingency Tables ```{r} #| message: false library(janitor) OKCupid <- read_csv( "http://nhorton.people.amherst.edu/is5/data/OKCupid_CatsDogs.csv", skip = 1 ) |> janitor::clean_names() ``` The `read_csv()` function lists the input variable names by default. These were suppressed using the `message: false` code chunk option to save space. Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). You can use the `names()` function to check the cleaned names. We use `skip = 1` because the first line in the original data set is a set of variable labels (e.g., `Col1`, `Col2`). ```{r} names(OKCupid) ``` The `names()` function is an easy way to see what variables are included in a dataset. ```{r} glimpse(OKCupid) ``` The `glimpse()` function provides more information. ```{r} # Table 3.1, page 65 tally(~ cats_dogs_both + gender, margin = TRUE, useNA = "no", data = OKCupid) # Table 3.2 tally(~ cats_dogs_both + gender, format = "percent", margin = TRUE, useNA = "no", data = OKCupid ) tally(cats_dogs_both ~ gender, format = "percent", margin = TRUE, useNA = "no", data = OKCupid ) # Table 3.3 tally(gender ~ cats_dogs_both, format = "percent", margin = TRUE, data = OKCupid) ``` We note that the logical values `TRUE` and `FALSE` are all caps in R, but that code chunk options are all lower-case (e.g., `message: false`). #### Example 3.1: Exploring Marginal Distributions We begin by reading and tallying the data. ```{r} #| message: false SuperBowl <- read_csv( "http://nhorton.people.amherst.edu/is5/data/Watch_the_Super_bowl.csv", skip = 1 ) tally(~ Plan + Sex, data = SuperBowl) ``` #### Example 3.2: Exploring Percentages: Children and First-Class Ticket Holders First? We do the same for the Titanic data. ```{r} #| message: false Titanic <- read_csv("http://nhorton.people.amherst.edu/is5/data/Titanic.csv") tally(~ Class + Survived, format = "percent", margin = TRUE, data = Titanic) tally(Class ~ Survived, format = "percent", margin = TRUE, data = Titanic) tally(Survived ~ Class, format = "percent", margin = TRUE, data = Titanic) ``` ### Section 3.2: Conditional Distributions See displays on 68-69. ```{r} #| fig.width: 7 OKdata <- tally( cats_dogs_both ~ gender, format = "percent", useNA = "no", data = OKCupid ) |> data.frame() # Figure 3.2, page 69 gf_col(Freq ~ gender, fill = ~ cats_dogs_both, position = "dodge", data = OKdata) |> gf_labs(x = "", y = "", fill = "") ``` #### Example 3.3: Finding Conditional Distributions: Watching the Super Bowl We can calculate conditional probabilities from tables using `mosaic::tally()`. ```{r} tally(~ Plan + Sex, margin = TRUE, data = SuperBowl) tally(Plan ~ Sex, format = "percent", data = SuperBowl) ``` #### Example 3.4: Looking for Associations Between Variables: Still Watching the Super Bowl ```{r} Superdata <- tally(Plan ~ Sex, format = "percent", data = SuperBowl) |> data.frame() gf_col(Freq ~ Plan, fill = ~Sex, position = "dodge", data = Superdata) ``` #### Examining Contingency Tables See displays on page 72. ```{r} #| message: false FishDiet <- read_csv("http://nhorton.people.amherst.edu/is5/data/Fish_diet.csv", skip = 1) |> janitor::clean_names() tally(~ diet_counts + cancer_counts, margins = TRUE, data = FishDiet) ``` #### Random Matters See display on page 74. ```{r} #| message: false Nightmares <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nightmares.csv", skip = 1) glimpse(Nightmares) Nightmares <- Nightmares |> # recode the `Dream` variable mutate(Dream = ifelse(Dream == "N", "Nightmare", "SweetDreams")) glimpse(Nightmares) ``` Now we can calculate the contingency table. ```{r} tally(~ Dream + Side, margins = TRUE, data = Nightmares) ``` ### Section 3.3: Displaying Contingency Tables ```{r} tally(~ Class + Survived, format = "count", data = Titanic) tally(~ Class + Survived, format = "percent", data = Titanic) # Figure 3.4, page 75 gf_percents(~ Class, fill = ~ Survived, position = position_dodge(), data = Titanic) # Figure 3.5 gf_percents(~ Survived, fill = ~ Class, position = "fill", data = Titanic) ``` ```{r} #| fig.width: 7 #| fig.height: 7 # Figure 3.6, page 76 vcd::mosaic(tally(~ Survived + Class, data = Titanic), main = "Mosaic plot of Class by Survival", shade = TRUE ) ``` See the mosaic plots on page 77. ### Section 3.4: Three Categorical Variables ```{r} tally(~ gender + cats_dogs_both + drugs_y_n, format = "percent", data = OKCupid) ``` #### Example 3.7: Looking for Associations Among Three Variables at Once We can repeat the mosaic plot with three variables. ```{r} #| fig.height: 5 #| fig.width: 7 vcd::mosaic(tally(~ Sex + Survived + Class, data = Titanic), shade = TRUE) ``` #### Example 3.8: Simpson's Paradox: Gender Discrimination? Here we demonstrate how to generate one of the tables on page 80. ```{r} # Create a dataframe from the counts # http://mathemathinking.blogspot.com/2012/06/simpsons-paradox.html Berk <- bind_rows( do(512) * data.frame(admit = TRUE, sex = "M", school = "A"), do(825 - 512) * data.frame(admit = FALSE, sex = "M", school = "A"), do(89) * data.frame(admit = TRUE, sex = "F", school = "A"), do(19) * data.frame(admit = FALSE, sex = "F", school = "A") ) ``` As noted previously, the logical values `TRUE` and `FALSE` are all caps in R, but that code chunk options are all lower-case (e.g., `message: false`). Here, the `do(n)` function is used to create `n` observations with the specified values in `data.frame()`. The `bind_rows()` function can then be used to combine the data frames into one. ```{r} tally(~ sex + admit, data = Berk) tally(admit ~ sex, format = "percent", data = Berk) ```