--- title: "IS5 in R: Relationships Between Categorical Variables--Contingency Tables (Chapter 3)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "December 19, 2020" output: pdf_document: fig_height: 3 fig_width: 5 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include = FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages library(mosaic) library(readr) library(janitor) ``` ```{r, include = FALSE} # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy = FALSE, # display code as typed size = "small" # slightly smaller font for code ) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. ## Chapter 3: Relationships Between Categorical Variables--Contingency Tables ### Section 3.1: Contingency Tables ```{r message = FALSE} library(mosaic) library(readr) library(janitor) OKCupid <- read_csv("http://nhorton.people.amherst.edu/is5/data/OKCupid_CatsDogs.csv", skip = 1) %>% janitor::clean_names() names(OKCupid) ``` The `read_csv()` function lists the input variable names by default. These were suppressed using the `message = FALSE` code chunk option to save space. Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). You can use the `names()` function to check the cleaned names. We use `skip = 1` because the first line in the original data set is a set of variable labels (e.g., `Col1`, `Col2`). ```{r} # Table 3.1, page 65 tally(~ cats_dogs_both + gender, margin = TRUE, useNA = "no", data = OKCupid) # Table 3.2 tally(~ cats_dogs_both + gender, format = "percent", margin = TRUE, useNA = "no", data = OKCupid ) tally(cats_dogs_both ~ gender, format = "percent", margin = TRUE, useNA = "no", data = OKCupid ) # Table 3.3 tally(gender ~ cats_dogs_both, format = "percent", margin = TRUE, data = OKCupid) ``` #### Example 3.1: Exploring Marginal Distributions We begin by reading and tallying the data. ```{r message = FALSE} SuperBowl <- read_csv("http://nhorton.people.amherst.edu/is5/data/Watch_the_Super_bowl.csv", skip = 1 ) tally(~ Plan + Sex, data = SuperBowl) ``` #### Example 3.2: Exploring Percentages: Children and First-Class Ticket Holders First? We do the same for the Titanic data. ```{r message = FALSE} Titanic <- read_csv("http://nhorton.people.amherst.edu/is5/data/Titanic.csv") tally(~ Class + Survived, format = "percent", margin = TRUE, data = Titanic) tally(Class ~ Survived, format = "percent", margin = TRUE, data = Titanic) tally(Survived ~ Class, format = "percent", margin = TRUE, data = Titanic) ``` ### Section 3.2: Conditional Distributions See displays on 68-69. ```{r fig.width=7} OKdata <- tally(cats_dogs_both ~ gender, format = "percent", useNA = "no", data = OKCupid ) %>% data.frame() # Figure 3.2, page 69 gf_col(Freq ~ gender, fill = ~cats_dogs_both, position = "dodge", data = OKdata) %>% gf_labs(x = "", y = "", fill = "") ``` #### Example 3.3: Finding Conditional Distributions: Watching the Super Bowl We can calculate conditional probabilities from tables using `mosaic::tally()`. ```{r} tally(~ Plan + Sex, margin = TRUE, data = SuperBowl) tally(Plan ~ Sex, format = "percent", data = SuperBowl) ``` #### Example 3.4: Looking for Associations Between Variables: Still Watching the Super Bowl ```{r} Superdata <- tally(Plan ~ Sex, format = "percent", data = SuperBowl) %>% data.frame() gf_col(Freq ~ Plan, fill = ~Sex, position = "dodge", data = Superdata) ``` #### Examining Contingency Tables See displays on page 72. ```{r message = FALSE} FishDiet <- read_csv("http://nhorton.people.amherst.edu/is5/data/Fish_diet.csv", skip = 1) %>% janitor::clean_names() tally(~ diet_counts + cancer_counts, margins = TRUE, data = FishDiet) ``` #### Random Matters See display on page 74. ```{r message = FALSE} Nightmares <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nightmares.csv", skip = 1) Nightmares <- Nightmares %>% mutate(Dream = ifelse(Dream == "N", "Nightmare", "SweetDreams")) tally(~ Dream + Side, margins = TRUE, data = Nightmares) ``` ### Section 3.3: Displaying Contingency Tables ```{r} tally(~ Class + Survived, format = "count", data = Titanic) tally(~ Class + Survived, format = "percent", data = Titanic) # Figure 3.4, page 75 gf_percents(~Class, fill = ~Survived, position = position_dodge(), data = Titanic) # Figure 3.5 gf_percents(~Survived, fill = ~Class, position = "fill", data = Titanic) ``` ```{r fig.width=7, fig.height = 7} # Figure 3.6, page 76 vcd::mosaic(tally(~ Survived + Class, data = Titanic), main = "Mosaic plot of Class by Survival", shade = TRUE ) ``` See the mosaic plots on page 77. ### Section 3.4: Three Categorical Variables ```{r} tally(~ gender + cats_dogs_both + drugs_y_n, format = "percent", data = OKCupid) ``` #### Example 3.7: Looking for Associations Among Three Variables at Once We can repeat the mosaic plot with three variables. ```{r fig.height=5, fig.width=7} vcd::mosaic(tally(~ Sex + Survived + Class, data = Titanic), shade = TRUE) ``` #### Example 3.8: Simpson's Paradox: Gender Discrimination? Here we demonstrate how to generate one of the tables on page 80. ```{r} # Create a dataframe from the counts # http://mathemathinking.blogspot.com/2012/06/simpsons-paradox.html Berk <- rbind( do(512) * data.frame(admit = TRUE, sex = "M", school = "A"), do(825 - 512) * data.frame(admit = FALSE, sex = "M", school = "A"), do(89) * data.frame(admit = TRUE, sex = "F", school = "A"), do(19) * data.frame(admit = FALSE, sex = "F", school = "A") ) ``` In this case, `do(n)` creates `n` observations with the specified values in `data.frame()`. The `rbind()` function can then be used to combine the data frames into one. ```{r} tally(~ sex + admit, data = Berk) tally(admit ~ sex, format = "percent", data = Berk) ```