--- title: "IS5 in R: Comparing Counts (Chapter 19)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "December 17, 2020" output: pdf_document: fig_height: 3 fig_width: 5 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include = FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages library(mosaic) library(readr) library(janitor) ``` ```{r, include = FALSE} # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy = FALSE, # display code as typed size = "small" # slightly smaller font for code ) ``` ## Introduction and background ## Chapter 19: Comparing Counts ```{r} library(mosaic) library(readr) library(janitor) Zodiac <- read_csv("http://nhorton.people.amherst.edu/is5/data/Zodiac.csv") ``` By default, `read_csv()` prints the variable names. These messages can be suppressed using the `message=FALSE` code chunk option to save space and improve readability. ```{r} Zodiac %>% select(Month, Births) ``` ### Section 19.1: Goodness-of-Fit Tests #### Example 19.1: Finding Expected Counts ```{r message = FALSE} # page 611 BaseballBirths <- read_csv("http://nhorton.people.amherst.edu/is5/data/Ballplayer_births.csv") %>% janitor::clean_names() # doesn't contain national birth % ``` Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). ```{r} natbirth <- c(.08, .07, .08, .08, .08, .08, .09, .09, .09, .09, .08, .09) BaseballBirths <- cbind(BaseballBirths, natbirth) # adding a column for national birth % totaln <- sum(~ballplayer_count, data = BaseballBirths) totaln BaseballBirths <- BaseballBirths %>% mutate( expected = totaln * natbirth, observed = ballplayer_count, contrib = (observed - expected)^2 / expected ) sum(~contrib, data = BaseballBirths) ``` #### Assumptions and Conditions #### Calculations #### Chi-Square P-values ```{r} # Examples of chisq p-values qchisq(df = 2, p = .1, lower.tail = FALSE) qchisq(df = 10, p = .05, lower.tail = FALSE) ``` #### Example 19.3: Doing a Goodness-of-Fit Test ```{r} # page 614 df <- nrow(BaseballBirths) - 1 df chisq <- sum(~contrib, data = BaseballBirths) xpchisq(q = chisq, df = df, lower.tail = FALSE) ``` #### Step-By-Step Example: A Chi-Square Test for Goodness-of-Fit ```{r warning = FALSE} expected <- mean(~Births, data = Zodiac) expected gf_col(Births ~ Month, data = Zodiac) %>% gf_hline(yintercept = expected) %>% gf_labs(x = "Sign", y = "Count") + theme(axis.text.x = element_text(angle = 30, hjust = 1)) # to adjust the angle of the x axis labels ``` ```{r} # Mechanics df <- nrow(Zodiac) - 1 df Zodiac <- Zodiac %>% mutate(chisq = ((Births - Expected)^2) / Expected) chisq <- sum(~chisq, data = Zodiac) chisq xpchisq(q = chisq, df = df, lower.tail = FALSE) ``` #### The Chi-Square Calculation ```{r} Zodiac %>% mutate(residsq = Residual^2) %>% mutate(component = residsq / Expected) ``` #### The Trouble with Goodness-of-Fit Tests: What's the Alternative? ### Section 19.2: Chi-Square Test of Homogeneity ```{r} # Create the data set Postgrad <- rbind( do(209) * data.frame(activity = "Employed", school = "Agriculture"), do(198) * data.frame(activity = "Employed", school = "Arts & Sciences"), do(177) * data.frame(activity = "Employed", school = "Engineering"), do(101) * data.frame(activity = "Employed", school = "ILR"), do(104) * data.frame(activity = "Grad School", school = "Agriculture"), do(171) * data.frame(activity = "Grad School", school = "Arts & Sciences"), do(158) * data.frame(activity = "Grad School", school = "Engineering"), do(33) * data.frame(activity = "Grad School", school = "ILR"), do(135) * data.frame(activity = "Other", school = "Agriculture"), do(115) * data.frame(activity = "Other", school = "Arts & Sciences"), do(39) * data.frame(activity = "Other", school = "Engineering"), do(16) * data.frame(activity = "Other", school = "ILR") ) ``` ```{r} # Table 19.1, page 618 tally(activity ~ school, data = Postgrad, margins = TRUE) # Table 19.2 tally(activity ~ school, format = "percent", data = Postgrad, margins = TRUE) # Table 19.3 with(chisq.test(tally(activity ~ school, data = Postgrad, margins = TRUE)), expected) ``` #### Step-By-Step Example: A Chi-Square Test for Homogeneity We can undertake a chi-square test for homogeneity. First let's display the data. ```{r, fig.width=7} tally(activity ~ school, format = "percent", data = Postgrad) %>% data.frame() %>% gf_col(Freq ~ school, fill = ~activity, position = "dodge") %>% gf_labs(x = "School", y = "Percent", fill = "") ``` ```{r} # Mechanics tally(activity ~ school, data = Postgrad, margins = TRUE) with(chisq.test(tally(activity ~ school, data = Postgrad, margins = TRUE)), expected) with(chisq.test(tally(activity ~ school, data = Postgrad)), statistic) xpchisq(q = 93.7, df = 6, lower.tail = FALSE) ``` ### Section 19.3: Examining the Residuals ```{r} # Table 19.4, page 622 with(chisq.test(tally(activity ~ school, data = Postgrad, margins = TRUE)), residuals) ``` #### Example 19.4: Looking at $\chi^2$, Residuals ```{r} BaseballBirths %>% mutate(residuals = (ballplayer_count - expected) / (expected^.5)) %>% select(month, residuals) ``` ### Section 19.4: Chi-Square Test of Independence ```{r message = FALSE} Tattoos <- read_csv("http://nhorton.people.amherst.edu/is5/data/Tattoos.csv", skip = 1) %>% janitor::clean_names() # skip = 1 because first row is "Col1", "Col2" # Table 19.5, page 623 tally(location ~ has_hepatitis_c, data = Tattoos, margins = TRUE) ``` #### Assumptions and Conditions #### Step-By-Step Example: A Chi-Square Test for Independence We use the `mosaic::tally()` function to prepare the data for the graphical display. ```{r} tally(has_hepatitis_c ~ location, format = "percent", data = Tattoos) %>% data.frame() %>% filter(has_hepatitis_c == "Yes") %>% gf_col(Freq ~ location) %>% gf_labs(x = "Tattoo Status", y = "Proportion Infected", title = "Tattoos and Hepatitis C") # Observed tally(location ~ has_hepatitis_c, data = Tattoos, margins = TRUE) # Expected with(chisq.test(tally(location ~ has_hepatitis_c, data = Tattoos, margins = TRUE)), expected) ``` We note the warning that several of the expected cell counts are less than 5, which raises concerns about the accuracy of the test. ```{r} # Mechanics with(chisq.test(tally(location ~ has_hepatitis_c, data = Tattoos)), statistic) xpchisq(q = 57.9, df = 2, lower.tail = FALSE) ``` #### Examine the Residuals ```{r} # Table 19.6, page 627 with(chisq.test(tally(location ~ has_hepatitis_c, data = Tattoos)), residuals) ``` ```{r} # Table 19.7, page 628 Tattoos <- Tattoos %>% mutate(tattoo = ifelse(location == "No Tattoo", "None", "Tattoo")) tally(tattoo ~ has_hepatitis_c, margins = TRUE, data = Tattoos) ``` #### Chi-Square and Causation