---
title: "IS5 in R: Comparing Counts (Chapter 19)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "2025-01-23"
date-format: iso
format: pdf
toc: true
editor: source
---

```{r}
#| label: setup
#| include: false
library(mosaic)   
library(tidyverse)
```


## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

We begin by loading packages that will be required for our analyses.

```{r}
library(mosaic)
library(tidyverse)
```

## Chapter 19: Comparing Counts

```{r}
Zodiac <- read_csv("http://nhorton.people.amherst.edu/is5/data/Zodiac.csv")
```

By default, `read_csv()` prints the variable names.
These messages can be suppressed using the `message: false` code chunk option to save space and improve readability.  

```{r}
Zodiac |>
  select(Month, Births)
```

### Section 19.1: Goodness-of-Fit Tests

#### Example 19.1: Finding Expected Counts

```{r}
#| message: false
# page 611
BaseballBirths <- read_csv("http://nhorton.people.amherst.edu/is5/data/Ballplayer_births.csv") |>
  janitor::clean_names() # doesn't contain national birth %
```

Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). 

```{r}
#| message: false
natbirth <- c(.08, .07, .08, .08, .08, .08, .09, .09, .09, .09, .08, .09)
BaseballBirths <- # adding a column for national birth %
  bind_cols(BaseballBirths, natbirth) 
totaln <- sum(~ ballplayer_count, data = BaseballBirths)
totaln
BaseballBirths <- BaseballBirths |>
  mutate(
    expected = totaln * natbirth,
    observed = ballplayer_count,
    contrib = (observed - expected)^2 / expected
  )
sum(~ contrib, data = BaseballBirths)
```

#### Assumptions and Conditions

#### Calculations

#### Chi-Square P-values

```{r}
# Examples of chisq p-values
qchisq(df = 2, p = .1, lower.tail = FALSE)
qchisq(df = 10, p = .05, lower.tail = FALSE)
```

#### Example 19.3: Doing a Goodness-of-Fit Test

```{r}
# page 614
df <- nrow(BaseballBirths) - 1
df
chisq <- sum(~contrib, data = BaseballBirths)
xpchisq(q = chisq, df = df, lower.tail = FALSE)
```

#### Step-By-Step Example: A Chi-Square Test for Goodness-of-Fit

```{r}
#| warning: false
expected <- mean(~ Births, data = Zodiac)
expected
gf_col(Births ~ Month, data = Zodiac) |>
  gf_hline(yintercept = expected) |>
  gf_labs(x = "Sign", y = "Count") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1)) # to adjust the angle of the x axis labels
```

```{r}
# Mechanics
df <- nrow(Zodiac) - 1
df
Zodiac <- Zodiac |>
  mutate(chisq = ((Births - Expected)^2) / Expected)
chisq <- sum(~chisq, data = Zodiac)
chisq
xpchisq(q = chisq, df = df, lower.tail = FALSE)
```

#### The Chi-Square Calculation

```{r}
Zodiac |>
  mutate(residsq = Residual^2) |>
  mutate(component = residsq / Expected)
```

#### The Trouble with Goodness-of-Fit Tests: What's the Alternative?

### Section 19.2: Chi-Square Test of Homogeneity

```{r}
# Create the data set
Postgrad <- bind_rows(
  do(209) * data.frame(activity = "Employed", school = "Agriculture"),
  do(198) * data.frame(activity = "Employed", school = "Arts & Sciences"),
  do(177) * data.frame(activity = "Employed", school = "Engineering"),
  do(101) * data.frame(activity = "Employed", school = "ILR"),
  do(104) * data.frame(activity = "Grad School", school = "Agriculture"),
  do(171) * data.frame(activity = "Grad School", school = "Arts & Sciences"),
  do(158) * data.frame(activity = "Grad School", school = "Engineering"),
  do(33) * data.frame(activity = "Grad School", school = "ILR"),
  do(135) * data.frame(activity = "Other", school = "Agriculture"),
  do(115) * data.frame(activity = "Other", school = "Arts & Sciences"),
  do(39) * data.frame(activity = "Other", school = "Engineering"),
  do(16) * data.frame(activity = "Other", school = "ILR")
)
```

```{r}
# Table 19.1, page 618
tally(activity ~ school, data = Postgrad, margins = TRUE)
# Table 19.2
tally(activity ~ school, format = "percent", data = Postgrad, margins = TRUE)
# Table 19.3
with(chisq.test(tally(activity ~ school, data = Postgrad, margins = TRUE)), expected)
```

#### Step-By-Step Example: A Chi-Square Test for Homogeneity

We can undertake a chi-square test for homogeneity. 
First let's display the data.

```{r, fig.width=7}
tally(activity ~ school, format = "percent", data = Postgrad) |>
  data.frame() |>
  gf_col(Freq ~ school, fill = ~activity, position = "dodge") |>
  gf_labs(x = "School", y = "Percent", fill = "")
```

```{r}
# Mechanics
tally(activity ~ school, data = Postgrad, margins = TRUE)
with(chisq.test(tally(activity ~ school, data = Postgrad, margins = TRUE)), expected)
with(chisq.test(tally(activity ~ school, data = Postgrad)), statistic)
xpchisq(q = 93.7, df = 6, lower.tail = FALSE)
```

### Section 19.3: Examining the Residuals

```{r}
# Table 19.4, page 622
with(chisq.test(tally(activity ~ school, data = Postgrad, margins = TRUE)), residuals)
```

#### Example 19.4: Looking at $\chi^2$, Residuals

```{r}
BaseballBirths |>
  mutate(residuals = (ballplayer_count - expected) / (expected^.5)) |>
  select(month, residuals)
```

### Section 19.4: Chi-Square Test of Independence

```{r}
#| message: false
Tattoos <- read_csv("http://nhorton.people.amherst.edu/is5/data/Tattoos.csv", skip = 1) |>
  janitor::clean_names() # skip = 1 because first row is "Col1", "Col2"
# Table 19.5, page 623
tally(location ~ has_hepatitis_c, data = Tattoos, margins = TRUE)
```

#### Assumptions and Conditions

#### Step-By-Step Example: A Chi-Square Test for Independence

We use the `mosaic::tally()` function to prepare the data for the graphical display.

```{r}
tally(has_hepatitis_c ~ location, format = "percent", data = Tattoos) |>
  data.frame() |>
  filter(has_hepatitis_c == "Yes") |>
  gf_col(Freq ~ location) |>
  gf_labs(x = "Tattoo Status", y = "Proportion Infected", title = "Tattoos and Hepatitis C")

# Observed
tally(location ~ has_hepatitis_c, data = Tattoos, margins = TRUE)
# Expected
with(chisq.test(tally(location ~ has_hepatitis_c, data = Tattoos, margins = TRUE)), expected)
```

We note the warning that several of the expected cell counts are less than 5, which raises concerns about the accuracy of the test.

```{r}
# Mechanics
with(chisq.test(tally(location ~ has_hepatitis_c, data = Tattoos)), statistic)
xpchisq(q = 57.9, df = 2, lower.tail = FALSE)
```

#### Examine the Residuals

```{r}
# Table 19.6, page 627
with(chisq.test(tally(location ~ has_hepatitis_c, data = Tattoos)), residuals)
```

```{r}
# Table 19.7, page 628
Tattoos <- Tattoos |>
  mutate(tattoo = ifelse(location == "No Tattoo", "None", "Tattoo"))
tally(tattoo ~ has_hepatitis_c, margins = TRUE, data = Tattoos)
```

#### Chi-Square and Causation