---
title: "Comparing Counts (Chapter 22)"
author: "Patrick Frenett, Vickie Ip, Sarah McDonald, and Nicholas Horton (nhorton@amherst.edu)"
date: "June 29, 2018"
output: 
  pdf_document:
    fig_height: 4
    fig_width: 6
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---
  
  
```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
require(mosaic)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE,     # display code as typed
  size = "small",    # slightly smaller font for code
  fig.align = "center"
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced 
as examples in the Fourth Edition of *Intro Stats* (2013) by De Veaux, Velleman, and Bock.
More information about the book can be found at http://wps.aw.com/aw_deveaux_stats_series.  This
file as well as the associated R Markdown reproducible analysis source file used to create it can be found at https://nhorton.people.amherst.edu/is4.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

Note that some of the figures in this document may differ slightly from those in the IS4 book due to small differences in datasets. However in all cases the analysis and techniques in R are accurate.

## Chapter 22: Comparing Counts

### Section 22.1: Goodness-of-fit tests

Here we verify the calculations of expected counts for
ballplayers by month (page 611).
```{r}
ballplayer <- c(137, 121, 116, 121, 126, 114, 
                102, 165, 134, 115, 105, 122)
national <- c(0.08, 0.07, 0.08, 0.08, 0.08, 0.08,
              0.09, 0.09, 0.09, 0.09, 0.08, 0.09)
n <- sum(~ ballplayer)
n
sum(~ national)
expect <- n * national
cbind(ballplayer, expect)
```

The chi-square quantile values in the table on the bottom of page 658 can be verified using the `xqt()` function.
```{r}
xqchisq(c(.90, .95, .975, .99, .995), df = 1)
```
These results match the first row: other values can be calculated by changing the
`df` argument.

The goodness of fit test on page 614 can be verified by calculating the
chi-square statistic.

```{r}
chisq <- sum((ballplayer - expect)^2/expect)
chisq
1 - xpchisq(chisq, df = 11, col = "slateblue2")
```

### Section 22.2: Chi-square test of homogeneity

Data from one university regarding
the association between postgraduation activity and area of study
is displayed in Table 22.1 (page 618).  

```{r}
schooldata <- rbind(
  do(209) * data.frame(activity = "agriculture",  area = "Employed"),
  do(198) * data.frame(activity = "arts/science", area = "Employed"),
  do(177) * data.frame(activity = "engineering",  area = "Employed"), 
  do(101) * data.frame(activity = "ILR",          area = "Employed"),
  do(104) * data.frame(activity = "agriculture",  area = "Grad school"),
  do(171) * data.frame(activity = "arts/science", area = "Grad school"),
  do(158) * data.frame(activity = "engineering",  area = "Grad school"), 
  do(33)  * data.frame(activity = "ILR",          area = "Grad school"),
  do(135) * data.frame(activity = "agriculture",  area = "Other"),
  do(115) * data.frame(activity = "arts/science", area = "Other"),
  do(39)  * data.frame(activity = "engineering",  area = "Other"), 
  do(16)  * data.frame(activity = "ILR",          area = "Other")
)

tally(~ activity + area, margins = TRUE, data = schooldata)
```

```{r fig.height = 5}
vcd::mosaic(tally(~ activity + area, data = schooldata), main = "Mosaicplot of Activity by area",
  shade = TRUE, ylab = "Area", xlab = "Activity")
xchisq.test(tally(~ activity + area, data = schooldata))
```


### Section 22.3: Examining the residuals

Note that the `xchisq.test()` function displays the standardized
residuals as the last item in each cell of the table (and these
match the results in Table 22.4 (page 623).