--- title: "IS5 in R: Comparing Counts (Chapter 19)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "2025-01-20" date-format: iso format: pdf toc: true editor: source --- ```{r} #| label: setup #| include: false library(mosaic) library(tidyverse) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. We begin by loading packages that will be required for our analyses. ```{r} library(mosaic) library(tidyverse) ``` ## Chapter 19: Comparing Counts ```{r} Zodiac <- read_csv("http://nhorton.people.amherst.edu/is5/data/Zodiac.csv") ``` By default, `read_csv()` prints the variable names. These messages can be suppressed using the `message: false` code chunk option to save space and improve readability. ```{r} Zodiac |> select(Month, Births) ``` ### Section 19.1: Goodness-of-Fit Tests #### Example 19.1: Finding Expected Counts ```{r} #| message: false # page 611 BaseballBirths <- read_csv("http://nhorton.people.amherst.edu/is5/data/Ballplayer_births.csv") |> janitor::clean_names() # doesn't contain national birth % ``` Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). ```{r} natbirth <- c(.08, .07, .08, .08, .08, .08, .09, .09, .09, .09, .08, .09) BaseballBirths <- cbind(BaseballBirths, natbirth) # adding a column for national birth % totaln <- sum(~ballplayer_count, data = BaseballBirths) totaln BaseballBirths <- BaseballBirths |> mutate( expected = totaln * natbirth, observed = ballplayer_count, contrib = (observed - expected)^2 / expected ) sum(~contrib, data = BaseballBirths) ``` #### Assumptions and Conditions #### Calculations #### Chi-Square P-values ```{r} # Examples of chisq p-values qchisq(df = 2, p = .1, lower.tail = FALSE) qchisq(df = 10, p = .05, lower.tail = FALSE) ``` #### Example 19.3: Doing a Goodness-of-Fit Test ```{r} # page 614 df <- nrow(BaseballBirths) - 1 df chisq <- sum(~contrib, data = BaseballBirths) xpchisq(q = chisq, df = df, lower.tail = FALSE) ``` #### Step-By-Step Example: A Chi-Square Test for Goodness-of-Fit ```{r} #| warning: false expected <- mean(~ Births, data = Zodiac) expected gf_col(Births ~ Month, data = Zodiac) |> gf_hline(yintercept = expected) |> gf_labs(x = "Sign", y = "Count") + theme(axis.text.x = element_text(angle = 30, hjust = 1)) # to adjust the angle of the x axis labels ``` ```{r} # Mechanics df <- nrow(Zodiac) - 1 df Zodiac <- Zodiac |> mutate(chisq = ((Births - Expected)^2) / Expected) chisq <- sum(~chisq, data = Zodiac) chisq xpchisq(q = chisq, df = df, lower.tail = FALSE) ``` #### The Chi-Square Calculation ```{r} Zodiac |> mutate(residsq = Residual^2) |> mutate(component = residsq / Expected) ``` #### The Trouble with Goodness-of-Fit Tests: What's the Alternative? ### Section 19.2: Chi-Square Test of Homogeneity ```{r} # Create the data set Postgrad <- bind_rows( do(209) * data.frame(activity = "Employed", school = "Agriculture"), do(198) * data.frame(activity = "Employed", school = "Arts & Sciences"), do(177) * data.frame(activity = "Employed", school = "Engineering"), do(101) * data.frame(activity = "Employed", school = "ILR"), do(104) * data.frame(activity = "Grad School", school = "Agriculture"), do(171) * data.frame(activity = "Grad School", school = "Arts & Sciences"), do(158) * data.frame(activity = "Grad School", school = "Engineering"), do(33) * data.frame(activity = "Grad School", school = "ILR"), do(135) * data.frame(activity = "Other", school = "Agriculture"), do(115) * data.frame(activity = "Other", school = "Arts & Sciences"), do(39) * data.frame(activity = "Other", school = "Engineering"), do(16) * data.frame(activity = "Other", school = "ILR") ) ``` ```{r} # Table 19.1, page 618 tally(activity ~ school, data = Postgrad, margins = TRUE) # Table 19.2 tally(activity ~ school, format = "percent", data = Postgrad, margins = TRUE) # Table 19.3 with(chisq.test(tally(activity ~ school, data = Postgrad, margins = TRUE)), expected) ``` #### Step-By-Step Example: A Chi-Square Test for Homogeneity We can undertake a chi-square test for homogeneity. First let's display the data. ```{r, fig.width=7} tally(activity ~ school, format = "percent", data = Postgrad) |> data.frame() |> gf_col(Freq ~ school, fill = ~activity, position = "dodge") |> gf_labs(x = "School", y = "Percent", fill = "") ``` ```{r} # Mechanics tally(activity ~ school, data = Postgrad, margins = TRUE) with(chisq.test(tally(activity ~ school, data = Postgrad, margins = TRUE)), expected) with(chisq.test(tally(activity ~ school, data = Postgrad)), statistic) xpchisq(q = 93.7, df = 6, lower.tail = FALSE) ``` ### Section 19.3: Examining the Residuals ```{r} # Table 19.4, page 622 with(chisq.test(tally(activity ~ school, data = Postgrad, margins = TRUE)), residuals) ``` #### Example 19.4: Looking at $\chi^2$, Residuals ```{r} BaseballBirths |> mutate(residuals = (ballplayer_count - expected) / (expected^.5)) |> select(month, residuals) ``` ### Section 19.4: Chi-Square Test of Independence ```{r} #| message: false Tattoos <- read_csv("http://nhorton.people.amherst.edu/is5/data/Tattoos.csv", skip = 1) |> janitor::clean_names() # skip = 1 because first row is "Col1", "Col2" # Table 19.5, page 623 tally(location ~ has_hepatitis_c, data = Tattoos, margins = TRUE) ``` #### Assumptions and Conditions #### Step-By-Step Example: A Chi-Square Test for Independence We use the `mosaic::tally()` function to prepare the data for the graphical display. ```{r} tally(has_hepatitis_c ~ location, format = "percent", data = Tattoos) |> data.frame() |> filter(has_hepatitis_c == "Yes") |> gf_col(Freq ~ location) |> gf_labs(x = "Tattoo Status", y = "Proportion Infected", title = "Tattoos and Hepatitis C") # Observed tally(location ~ has_hepatitis_c, data = Tattoos, margins = TRUE) # Expected with(chisq.test(tally(location ~ has_hepatitis_c, data = Tattoos, margins = TRUE)), expected) ``` We note the warning that several of the expected cell counts are less than 5, which raises concerns about the accuracy of the test. ```{r} # Mechanics with(chisq.test(tally(location ~ has_hepatitis_c, data = Tattoos)), statistic) xpchisq(q = 57.9, df = 2, lower.tail = FALSE) ``` #### Examine the Residuals ```{r} # Table 19.6, page 627 with(chisq.test(tally(location ~ has_hepatitis_c, data = Tattoos)), residuals) ``` ```{r} # Table 19.7, page 628 Tattoos <- Tattoos |> mutate(tattoo = ifelse(location == "No Tattoo", "None", "Tattoo")) tally(tattoo ~ has_hepatitis_c, margins = TRUE, data = Tattoos) ``` #### Chi-Square and Causation