---
title: "IS5 in R: Testing Hypotheses (Chapter 15)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "December 13, 2020"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 5
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
library(mosaic)
library(readr)
library(janitor)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE, # display code as typed
  size = "small" # slightly smaller font for code
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

## Chapter 15: Testing Hypotheses

```{r}
library(mosaic)
library(readr)
library(janitor)
```   

### Section 15.1: Hypotheses

### Section 15.2: P-Values

### Section 15.3: The Reasoning of Hypothesis Testing

#### Example 15.5: Finding A P-Value

It is straightforward to find p-values using summary statistics. 

```{r}
n <- 90
x <- 61
p <- .8
phat <- x / n
sdphat <- ((p * (1 - p)) / n)^.5
z <- (phat - p) / sdphat
pnorm(z)
# Or, without calculating the z-score:
pnorm(q = phat, mean = p, sd = sdphat)
```

### Section 15.4: A Hypothesis Test for the Mean

We begin by reading the data.

```{r message = FALSE}
GestationTime <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nashville.csv")
```

By default, `read_csv()` prints the variable names. These messages can be suppressed using the `message=FALSE` code chunk option to save space and improve readability.  

```{r warning = FALSE}
# 2. Model (page 482)
gf_histogram(~Gestation, data = GestationTime, binwidth = 7.5, center = 3.75) %>%
  gf_labs(x = "Gestation Time (days)", y = "# of Births")
# 3. Mechanics
gf_dist(dist = "t", df = 69) %>%
  gf_vline(xintercept = -3.118) %>%
  gf_vline(xintercept = 3.118) %>%
  gf_labs(x = "", y = "") +
  xlim(-3.347, 3.347)
```

#### Step-By-Step Example: A One-Sample *t*-Test for the Mean

We begin by reading in the data.

```{r message = FALSE, warning = FALSE}
# page 485
Sleep <- read_csv("http://nhorton.people.amherst.edu/is5/data/Sleep.csv")
# Plan
df_stats(~Sleep, data = Sleep)
gf_histogram(~Sleep, data = Sleep, binwidth = 1) %>%
  gf_labs(x = "Hours of Sleep", y = "")
gf_dist(dist = "t", df = 24) %>%
  gf_vline(xintercept = -1.67) %>%
  gf_labs(x = "", y = "") +
  xlim(-3, 3)
# Mechanics
n <- 25
mean <- 7.0
df <- 24
y <- 6.64
s <- 1.075
sey <- s / (n^.5)
t <- (y - mean) / sey # t-statistic
pt(q = t, df = df) # p-value
```

### Section 15.5: Intervals and Tests

It is straightforward to calculate confidence intervals and carry out hypothesis tests.

```{r message = FALSE}
# page 487
Temperatures <- read_csv("http://nhorton.people.amherst.edu/is5/data/Normal_temperature.csv")
df_stats(~Temp, data = Temperatures)
gf_histogram(~Temp, data = Temperatures, binwidth = .2)
# Confidence interval
y <- mean(~Temp, data = Temperatures)
y
s <- sd(~Temp, data = Temperatures)
s
n <- nrow(Temperatures)
n
tstats <- qt(df = n - 1, p = c(.005, .995))
tstats
y + (tstats * (s / (n^.5)))
# Hypothesis test
mu <- 98.6
t <- (y - mu) / (s / (n^.5))
t
2 * pt(q = t, df = n - 1) # two sided test
```

#### Random Matters: Bootstrap Hypothesis Tests and Intervals

The boostrap is a flexible alternative approach to inference.

```{r}
numsamp <- 10000

# What does do() do?
mean(~Temp, data = resample(Temperatures)) # Mean of one random resample
mean(~Temp, data = resample(Temperatures)) # Mean of another random resample

do(2) * mean(~Temp, data = resample(Temperatures)) # Calculates means of two resamples

# We will use do() a numsamp number of times
resampletemps <- do(numsamp) * mean(~Temp, data = resample(Temperatures))
```

For more information about `resample()`, refer to the [resample vignette in mosaic](https://cran.r-project.org/web/packages/mosaic/vignettes/Resampling.pdf).  

```{r warning = FALSE}
gf_histogram(~mean, data = resampletemps) %>%
  gf_labs(x = "Mean Temperature", y = "# of Samples")
qdata(~mean, p = c(.005, .995), data = resampletemps) # reject null hypothesis

# Making a model-centric distribution
Temperatures2 <- Temperatures %>%
  mutate(Temp = Temp + .315)
resampletemps2 <- do(numsamp) * mean(~Temp, data = resample(Temperatures2))
gf_histogram(~mean, data = resampletemps2) %>%
  gf_vline(xintercept = mean(~Temp, data = Temperatures)) %>%
  gf_labs(x = "Mean Temperature Centered at 98.6", y = "# of Samples")
```

#### Step-By-Step Example: Tests and Intervals

We begin by creating the dataset.

```{r}
# Creating the data set
Baseball <- rbind(
  do(1308) * (winner <- "HOME"),
  do(2431 - 1308) * (winner <- "AWAY")
) %>%
  rename(winner = result)
# Mechanics (page 490)
n <- nrow(Baseball)
p <- .5
phat <- Baseball %>%
  filter(winner == "HOME") %>%
  nrow() / n
phat
sdphat <- ((p * (1 - p)) / n)^.5
sdphat
z <- (phat - p) / sdphat # z-value
z
1 - pnorm(z) # p-value
# Or, without calculating the z-score:
1 - pnorm(q = phat, mean = p, sd = sdphat)
# Mechanics (page 491)
sep <- ((phat * (1 - phat)) / n)^.5
sep
me <- 1.96 * sep
phat - me # lower bound of 95% confidence
phat + me # upper bound of 95% confidence
```

### Section 15.6: P-Values and Decisions: What to Tell About a Hypothesis Test