---
title: "IPS9 in R: Inference for means (Chapter 7)"
author: "Bonnie Lin and Nicholas Horton (nhorton@amherst.edu)"
date: "July 22, 2018"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 5
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
require(mosaic)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
knitr::opts_chunk$set(
  tidy = FALSE,     # display code as typed
  size = "small"    # slightly smaller font for code
)
```

## Introduction and background 

These documents are intended to help describe how to undertake analyses introduced 
as examples in the Ninth Edition of \emph{Introduction to the Practice of Statistics} (2017) by Moore, McCabe, and Craig.

More information about the book can be found [here](https://macmillanlearning.com/Catalog/product/introductiontothepracticeofstatistics-ninthedition-moore).
The data used in these documents can be found under Data Sets in the [Student Site](https://www.macmillanlearning.com/catalog/studentresources/ips9e?_ga=2.29224888.526668012.1531487989-1209447309.1529940008#). This
file as well as the associated R Markdown reproducible analysis source file used to create it can be found at https://nhorton.people.amherst.edu/ips9/.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic).  A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.
  
## Chapter 7: Inference for means
This file replicates the analyses from Chapter 7: Inference for means.

First, load the packages that will be needed for this document: 
```{r load-packages}
library(mosaic)
library(readr)
```

### Section 7.1: Inference for the mean of a population 
First, we need to clean up the data of average time spent watching TV and draw a 
simple random sample (SRS) of size 8 for this problem.
We use the following functions to find the mean, standard deviation, and 95% confidence 
interval as shown on page 411-412. We also check the assumptions and conditions for a 
Student's t-test by looking at the qq plot. 
```{r eg7-1, message=FALSE}
TVTime <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter07/EG07-01TVTIME.csv")
TVTime <- TVTime %>% select(Time) %>% head(., 8)
favstats(~ Time, data = TVTime)
t.test(~ Time, data = TVTime) 
# Figure 7.2
gf_qq(~ Time, data = TVTime) %>%
  gf_labs(x = "Normal score", y = "Time (hours per week)") 
```

Then, we can conduct a significance test on the null hypothesis that the sample mean 
would be equal to the overall U.S. average as demonstrated on page 414: 

```{r eg7-3}
t.test(~ Time, data = TVTime, alternative = "less")
```

### Section 7.2: Comparing two means
By performing a significance test between the S&P 500 return and an investor's 
stock portfolio (page 415-418), we can assess the quality of a broker's management of 
this portfolio.
```{r eg7-4, message=FALSE}
STOCK <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter07/EG07-04STOCK.csv")
favstats(~ Return, data = STOCK)
sigtest_STOCK <- t.test(~ Return, data = STOCK, alternative = "two.sided")
confint_STOCK <- with(sigtest_STOCK, conf.int)
confint_STOCK - 0.95
```
We can use the `with()` function to extract the confidence interval from the 
t.test output. To obtain the corresponding interval for the underperformance, we
can estimate the confidence interval of the amount that the investor should be 
compensated with. 

```{r eg7-7, message=FALSE}
GEPARTS <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter07/EG07-07GEPARTS.csv") 
gf_dhistogram(~ Diff, data = GEPARTS, binwidth = 1/3, center = 1/6)
with(GEPARTS, t.test(OptionOn, OptionOff, var.equal = TRUE, conf.level = 0.90))
```

```{r eg7-11, message=FALSE}
DRP <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter07/EG07-11DRP.csv")
# Figure 7.12, page 438
gf_qq(~ drp, color = ~ group, data = DRP) %>%
  gf_qqline(color = "black", linetype = "solid") %>%
  gf_labs(x = "Normal score", y = "Treatment/Control group DRP scores")
# Summary statistics, page 439
favstats(drp ~ group, data = DRP)

# 95% confidence interval for difference between treatment and control groups
t.test(drp ~ group, data = DRP)
```
Note that textbook reports the difference as the mean of treatment minus the mean of the control, 
while the `t.test()` function here reports the differnece in the opposite order. 

```{r eg7-13}
t.test(drp ~ group, data = DRP, alternative = "greater")
```
Again, note that the negated t value can be attributed to the same reason as above. 

```{r eg7-16,message=FALSE}
# Example 7.16, page 444
EATER <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter07/EG07-16EATER.csv") %>% 
  na.omit()
favstats(WTLOSS ~ Group, data = EATER)
diffmean(WTLOSS ~ Group, data = EATER)
# Note that R calculates the difference of the early-eater mean from the later-eater mean

# 95% confidence intervals, page 476
t.test(WTLOSS ~ Group, data = EATER, var.equal = TRUE) 
#Equal variances asssumed
t.test(WTLOSS ~ Group, data = EATER) 
#Equal variances not assumed 
#var.equal is FALSE by default
```
Since the last row of the dataset had missing values, we piped the data into the `na.omit()` to 
remove the N/A's from our analysis. 

Another way to think about the `var.equal` argument in the `t.test()` function above is 
in terms of pooled variances. If we want to use the pooled  two-sample *t* procedure, we have to 
specify `var.equal` to be TRUE. We will demonstrate that in the following example: 
```{r eg7-18, message=FALSE}
# Example 7.19, page 451-452
BP_CA <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter07/EG07-18BP_CA.csv")
## XX possibly wrong datapoint? 
favstats(dec ~ group, data = BP_CA)
t.test(dec ~ group, data = BP_CA, var.equal = TRUE, conf.level = 0.90)
```

```{r eg7-25, message=FALSE}
### Example 7.25, page 470
SONGS <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter07/EG07-25SONGS.csv")
## Checking the Normality condition
gf_qq(~ total_secs, data = SONGS) %>% 
  gf_qqline(linetype = "dashed", color = "red") %>% 
  gf_labs(x = "Normal score", y = "Time (seconds)")
## Check the condition after *transforming* the variable
gf_qq(~ log(total_secs), data = SONGS) %>% 
  gf_qqline(linetype = "dashed", color = "red") %>% 
  gf_labs(x = "Normal score", y = "Time (seconds)")

log_total_secs_SONGS <- SONGS %>% mutate(log_total_secs = log(total_secs))

## Comparing the 95% confidence intervals
### With transformation
t.test(~ log_total_secs, data = log_total_secs_SONGS)
### Without transformation
t.test(~ total_secs, data = SONGS)
```

Since the logarithmic transformation made the Normal quantile plot distribution appear 
approximately Normal, we created a dataset called `log_total_secs` with the 
transformed variable. 
### Section 7.3: Additional topics on inference