---
title: "IS5 in R: Comparing Groups (Chapter 17)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "December 19, 2020"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 5
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
library(mosaic)
library(readr)
library(janitor)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE, # display code as typed
  size = "small" # slightly smaller font for code
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

## Chapter 17: Comparing Groups

```{r}
library(mosaic)
library(readr)
library(janitor)
```

### Section 17.1: A Confidence Interval for the Difference Between Two Proportions

```{r}
# Creating a data frame for Seatbelts
Seatbelts <- rbind(
  do(2777) * data.frame(passenger = "F", belted = TRUE),
  do(4208 - 2777) * data.frame(passenger = "F", belted = FALSE),
  do(1363) * data.frame(passenger = "M", belted = TRUE),
  do(2763 - 1363) * data.frame(passenger = "M", belted = FALSE)
) %>%
  select(passenger, belted)
```

The `mosaic::do()` function constructs the correct number of rows for the data frame from provided cell counts.

```{r belted, cache=TRUE}
set.seed(234)
numsim <- 10000

# What does do() do?
abs(diffmean(belted ~ passenger, data = resample(Seatbelts)))
# Difference of proportions from one random resample
abs(diffmean(belted ~ passenger, data = resample(Seatbelts)))
# Difference of proportions from another random resample

do(2) * abs(diffmean(belted ~ passenger, data = resample(Seatbelts))) # Calculates two differences

# We want to calculate numsim resampled differences of proportions
seatbeltresamples <- do(numsim) * abs(diffmean(belted ~ passenger, data = resample(Seatbelts)))
```

For more information about `resample()`, refer to the resampling vignette: https://cran.r-project.org/web/packages/mosaic/vignettes/Resampling.pdf.  

```{r}
# Figure 17.1, page 542
gf_histogram(~diffmean, data = seatbeltresamples) %>%
  gf_labs(x = "Difference of Proportions", y = "# of Resamples")
```

#### Example 17.1: Finding the Standard Error of a Difference in Proportions

We begin with some wrangling to create the dataset.

```{r}
# Creating the data set for online profiles
OnlineProf <- rbind(
  do(141) * data.frame(gender = "M", profile = TRUE), # 248 * .57 rounds to 141
  do(107) * data.frame(gender = "M", profile = FALSE), # 248 - 141
  do(179) * data.frame(gender = "F", profile = TRUE),
  do(77) * data.frame(gender = "F", profile = FALSE)
)
tally(~gender, data = OnlineProf)
```

```{r}
OnlineProfM <- OnlineProf %>%
  filter(gender == "M") # Make a data set for male observations
nM <- nrow(OnlineProfM)
nM # n for males
propMyes <- mean(~profile, data = OnlineProfM)
propMyes # p for males
sepboys <- ((propMyes * (1 - propMyes)) / nM)^.5
sepboys # SE for males
```

```{r}
OnlineProfF <- OnlineProf %>%
  filter(gender == "F") # Make a data set for male observations
nF <- nrow(OnlineProfF)
nF # n for females
propFyes <- mean(~profile, data = OnlineProfF)
propFyes # p for females
sepgirls <- ((propFyes * (1 - propFyes)) / nF)^.5
sepgirls # SE for females
```

```{r}
sep <- (sepboys^2 + sepgirls^2)^.5
sep # overall SE
```

#### Example 17.2: Finding a Two-Proportion *z*-Interval

We can calculate the desired Z interval.

```{r}
zstats <- qnorm(p = c(.025, .975))
(propFyes - propMyes) + zstats * sep

# Or, you can use:
prop.test(x = c(179, 141), n = c(nF, nM), correct = FALSE)
```

The `prop.test()` function can be used to find confidence intervals and p-values of both one or two proportion z-tests.  

### Section 17.2: Assumptions and Conditions for Comparing Proportions

### Section 17.3: The Two-Sample *z*-Test: Testing for the Difference Between Proportions

#### Step-By-Step Example: A Two-Proportion *z*-Test

Again, we need to create the data table of counts.

```{r}
# Create the data set
SleepHabits <- rbind(
  do(205) * data.frame(gen = "GenY", internet = TRUE),
  do(293 - 205) * data.frame(gen = "GenY", internet = FALSE),
  do(235) * data.frame(gen = "GenX", internet = TRUE),
  do(469 - 235) * data.frame(gen = "GenX", internet = FALSE)
)
```

```{r}
# Mechanics
ngeny <- nrow(filter(SleepHabits, gen == "GenY"))
ngeny # n for GenY
ygeny <- nrow(filter(SleepHabits, gen == "GenY" & internet == TRUE))
ygeny # y for GenY
pgeny <- mean(~internet, data = filter(SleepHabits, gen == "GenY"))
pgeny # proportion for GenY

ngenx <- nrow(filter(SleepHabits, gen == "GenX"))
ngenx # n for GenX
ygenx <- nrow(filter(SleepHabits, gen == "GenX" & internet == TRUE))
ygenx # y for GenX
pgenx <- mean(~internet, data = filter(SleepHabits, gen == "GenX"))
pgenx # proportion for GenX

sepgen <- ((pgeny * (1 - pgeny)) / ngeny + (pgenx * (1 - pgenx)) / ngenx)^.5
sepgen # overall SE
pdiff <- pgeny - pgenx
pdiff # difference between proportions
z <- (pdiff - 0) / sepgen
z
2 * pnorm(q = z, lower.tail = FALSE)
```

### Section 17.4: A Confidence Interval for the Difference Between Two Means

The `t.test()` function can be used to generate a confidence interval for the difference between two means.  The `conf.level` option can be used to create different intervals.

#### Example 17.7: Finding a Confidence Interval for the Difference in Sample Means

We can calculate the confidence interval using summary statistics.

```{r}
# page 555
nord <- 27 # n for ordinary bowls
nref <- 27 # n for refilling bowls
yord <- 8.5 # y for ordinary bowls
yref <- 14.7 # y for refilling bowls
sord <- 6.1 # standard deviation for ordinary bowls
sref <- 8.4 # standard deviation for refilling bowls

seys <- 2.0 # overall SE
diffy <- yref - yord # difference between y's is 6.2
tstats <- qt(p = c(.025, .975), df = 47.46)
tstats
me <- tstats * seys
me # margin of error
diffy + me # confidence interval
```

### Section 17.5: The Two-Sample *t*-Test: Testing for the Difference Between Two Means

#### Step-By-Step Example: A Two-Sample *t*-Test for the Difference Between the Two Means

We begin by reading the data.

```{r message = FALSE}
# page 556
BuyingCam <- read_csv("http://nhorton.people.amherst.edu/is5/data/Buy_from_a_friend.csv")
```


```{r message = FALSE, warning = FALSE}
library(tidyr) # for gather() function
BuyingCam <- BuyingCam %>%
  gather(key = buying_type, value = amount_offered, Friend, Stranger)
# Model
gf_boxplot(amount_offered ~ buying_type, fill = ~buying_type, data = BuyingCam) %>%
  gf_labs(x = "Buying From", y = "Amount Offered ($)", fill = "")

BuyingCam %>%
  filter(buying_type == "Friend") %>%
  gf_histogram(~amount_offered, binwidth = 25, center = 12) %>%
  gf_labs(x = "Buy from Friend", y = "")
BuyingCam %>%
  filter(buying_type == "Stranger") %>%
  gf_histogram(~amount_offered, binwidth = 50, center = 20) %>%
  gf_labs(x = "Buy from Stranger", y = "")
```

We can replicate the analyses on pages 557-558.
```{r}
df_stats(amount_offered ~ buying_type, data = BuyingCam)
t.test(amount_offered ~ buying_type, data = BuyingCam)
```

### Section 17.6: Randomization Tests and Confidence Intervals for Two Means

We begin by reading in the Cars dataset.

```{r message = FALSE}
Cars <- read_csv("http://nhorton.people.amherst.edu/is5/data/Car_speeds.csv")
# Figure 17.2 (page 560) is the same as Figure 4.4 (page 102)
df_stats(speed ~ direction, data = Cars)
```

```{r Carsims, cache=TRUE, warning = FALSE}
set.seed(23456)
numsim <- 10000
CarSims <- do(numsim) * diffmean(speed ~ shuffle(direction), data = Cars)
# Figure 17.3, page 560
gf_histogram(~diffmean, data = CarSims, binwidth = .1, center = .05) %>%
  gf_vline(xintercept = 2.53) %>%
  gf_labs(x = "Differences in Means from 10,000 Trials", y = "# of Trials")
```

```{r Carboots, cache=TRUE}
set.seed(32453)
numsim <- 10000

CarBoots <- do(numsim) * diffmean(speed ~ direction, data = resample(Cars))
qdata(~diffmean, p = c(.025, .975), data = CarBoots)
CarBoots <- CarBoots %>%
  mutate(interval = ifelse(diffmean > 1.88 & diffmean < 3.19, "Within 95% Confidence",
    "Outside 95% Confidence"
  ))
```

```{r warning = FALSE}
# Figure 17.4
gf_histogram(~diffmean,
  fill = ~interval, data = CarBoots, binwidth = .2,
  center = .1
) %>%
  gf_vline(xintercept = 1.88) %>%
  gf_vline(xintercept = 3.19) %>%
  gf_labs(x = "Difference in Means", y = "# of Trials") +
  guides(fill = FALSE) # to remove the legend
```

### Section 17.7: Pooling

The pooled variance t.test can be generated by using the option `var.equal = TRUE`.

```{r}
t.test(amount_offered ~ buying_type, var.equal = TRUE, data = BuyingCam)
```


### Section 17.8: The Standard Deviation of a Difference