--- title: "IPS9 in R: Bootstrap Methods and Permutation Tests (Chapter 16)" author: "Bonnie Lin and Nicholas Horton (nhorton@amherst.edu)" date: "July 22, 2018" output: pdf_document: fig_height: 3 fig_width: 5 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include = FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages require(mosaic) library(readr) ``` ```{r, include = FALSE, message = FALSE} # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy = FALSE, # display code as typed size = "small", # slightly smaller font for code message = FALSE ) ``` ## Introduction and background These documents are intended to help describe how to undertake analyses introduced as examples in the Ninth Edition of \emph{Introduction to the Practice of Statistics} (2017) by Moore, McCabe, and Craig. More information about the book can be found [here](https://macmillanlearning.com/Catalog/product/introductiontothepracticeofstatistics-ninthedition-moore). The data used in these documents can be found under Data Sets in the [Student Site](https://www.macmillanlearning.com/catalog/studentresources/ips9e?_ga=2.29224888.526668012.1531487989-1209447309.1529940008#). This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at https://nhorton.people.amherst.edu/ips9/. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. For those less familiar with resampling, the mosaic resampling vignette may be a useful additional reference (https://cran.r-project.org/web/packages/mosaic/vignettes/Resampling.pdf). ## Chapter 1: Looking at data -- distributions This file replicates the analyses from the (online-only) Chapter 16: Bootstrap Methods and Permutation Tests. First, load the packages that will be needed for this document: ```{r load-packages} library(mosaic) library(readr) ``` ### Section 16.1: The bootstrap idea With the following code, we create many resamples by repeatedly sampling with replacement from the sample of 21 Facebook profiles. Then, we create the bootstrap distribution by finding the mean for the 3,000 resamples. The histogram and qqplots reproduced below can be found on page 5. ```{r eg16-2} FACE4 <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter16/EG16-01FACE4.csv") set.seed(2018) numsim <- 3000 Avg_Viewtime <- do(numsim) * mean(~ Time, data = resample(FACE4, 21)) #Figure 16.3(a) favstats(~ mean, data = Avg_Viewtime) gf_histogram(~ mean, data = Avg_Viewtime) %>% gf_labs(x = "Mean time of resamples (in minutes)", y = "Count") #Figure 16.3(b) gf_qq(~ mean, data = Avg_Viewtime) %>% gf_qqline(color = "red", linetype = "solid") %>% gf_labs(x = "Normal score", y = "Mean time of resamples (in minutes)") ``` The pseudorandom number used to set the seed enables the reproduction of results. The `do()` function in the `mosaic` package simplifies the iteration and resampling processes by allowing us to set the number of repetitions and specify what to repeat. ### Section 16.2: First steps in using the bootstrap The strongly right skewed distribution is shown on the histogram and Normal quantile plot for the original 150 grade points (page 14): ```{r eg16-4-sample} College_GPA <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter16/EG16-04GPA.csv") ## Figure 16.8(a) gf_dhistogram(~ GPA, data = College_GPA) %>% gf_labs(x = "GPA", y = "Percent") ## Figure 16.8(b) gf_qq(~ GPA, data = College_GPA) %>% gf_lims(y = c(0, 4)) %>% gf_labs(x = "Normal score", y = "GPA") ``` We will now use the 25% trimmed mean as the parameter and use the sample of 150 students to find the statistic that estimates this parameter. The process is similar to the bootstrapping done above, except that we now specify the "trim" argument in the `mean()` function: ```{r eg 16-4-trimmed, warning=FALSE} mean_GPA <- mean(~ GPA, data = College_GPA, trim = 0.25) numsim <- 3000 GPA_sims <- do(numsim) * mean(~ GPA, data = resample(College_GPA, 150), trim = 0.25) ## Figure 16.9(a) gf_dhistogram(~ mean, data = GPA_sims, binwidth = 0.05, bins = 12, center = 0.05/2) %>% gf_lims(x = c(2.7, 3.3)) %>% gf_fitdistr(dist = "dnorm") %>% gf_labs(x = "Means of resamples") %>% gf_vline(xintercept = as.numeric(mean_GPA)) ## XX Graph doesn't look like the textbook, missing the upstep next to the population trimmed mean ## Figure 16.9(b) gf_qq(~ mean, data = GPA_sims) %>% gf_labs(x = "Normal score", y = "Means of resamples") %>% gf_qqline(linetype = "solid", color = "red") ## XX Graph doesn't look like the textbook one, missing outliers ``` Since the seed has been previously set, we do not need to set the seed in this code chunk again. To find the 95% confidence interval for the population trimmed mean as the book does on page 16, we can type: ```{r eg16-5} confint(GPA_sims, level = 0.95) ## XX The textbook reports 2.794, 3.106 and the standard error is 0.078. ## Now wondering if the GPA_sims is the right data ``` Below we have replicated the Density curves and Normal quantile plots of the distributions of GPA for males and females as well as the summary statistics and the bootstrap standard error for the difference in sample means (page 17-18): ```{r eg16-6} College_GPA <- College_GPA %>% mutate(sex = dplyr::recode(sex, "1" = "Male", "2" = "Female")) ## Figure 16.10(a) gf_dens(~ GPA, data = College_GPA, color = ~ sex) ## Figure 16.10(b) gf_qq(~ GPA, data = College_GPA, color = ~ sex) Summary_GPA <- cbind( tally(~ sex, data = College_GPA), mean(GPA ~ sex, data = College_GPA) %>% round(digits = 3), sd(GPA ~ sex, data = College_GPA) %>% round(digits = 3)) colnames(Summary_GPA) = c("n", "Mean", "Standard Error") Summary_GPA ## Bootstrapping numsim <- 3000 Male <- College_GPA %>% filter(sex == "Male") Female <- College_GPA %>% filter(sex == "Female") diffMeanGPA <- function(Male, Female){ Sample_Male <- resample(Male) Sample_Female<- resample(Female) diff(c(mean(~ GPA, data = Sample_Male), mean(~ GPA, data = Sample_Female))) } GPA_sex_sims <- do(numsim) * diffMeanGPA(Male, Female) # Figure 16.11(a) gf_dhistogram(~ diffMeanGPA, data = GPA_sex_sims, binwidth = 0.1, center = 0.05) %>% gf_fitdistr(dist = "norm") %>% gf_labs(x = "Difference in means of resamples") # Figure 16.11(b) gf_qq(~ diffMeanGPA, data = GPA_sex_sims) %>% gf_labs(x = "Normal score", y = "Difference in means of resamples") %>% gf_qqline(linetype = "solid", color = "red") ``` We resample separately from the two samples, so that each of our 3000 resamples consists of two group resamples, one of size 91 drawn with replacement from the male data and one of size 59 drawn with replacement from the female data. XX No data for Example 16.7 (page 20) "Do all daily numbers have an equal payoff" ### Section 16.3: How accurate is a bootstrap distribution? ### Section 16.4: Bootstrap confidence intervals Up to this point, we have demonstrated the use of the bootstrap to find confidence intervals of the sample mean, trimmed mean, and the difference between two means. Now we will bootstrap the confidence intervals for the correlation coefficient, as introduced on page 35. ```{r eg16-10} LAUND24 <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter16/EG16-10LAUND24.csv") # Figure 16.19 gf_point(Rating ~ PricePerLoad, data = LAUND24) # Figure 16.20(a) numsim <- 5000 cor_LAUND24 <- cor(Rating ~ PricePerLoad, data = LAUND24) LAUND_sims <- do(numsim) * cor(Rating ~ PricePerLoad, data = resample(LAUND24)) gf_dhistogram( ~ cor, data = LAUND_sims, binwidth = 0.05, center = 0.025) %>% gf_fitdistr(dist = "dnorm") %>% gf_labs(x = "Correlation coefficient of resamples") %>% gf_vline(xintercept = cor_LAUND24) # Figure 16.20(b) gf_qq(~ cor, data = LAUND_sims) %>% gf_labs(x = "Normal score", y = "Correlation coefficient of resamples") %>% gf_qqline(linetype = "solid", color = "red") # Computing the 95% bootstrap percentile interval confint(LAUND_sims, method=c("se", "perc")) ``` ### Section 16.5: Significance testing using permutation tests Permutation tests use resampling to generate a null distribution for a test statistic, with each step randomly resampling without replacement and recalculating the test statistic in each permutation. In the example provided in the textbook, the permutation test randomizes the assignment of students to either the treatment or control group to test for the effect of the treatment, "directed reading activities" in this case (page 43): ```{r eg16-12} DRP <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter16/EG16-11DRP.csv") numsim <- 1000 DRP_sim <- do(numsim) * diffmean(drp ~ shuffle(group), data = DRP) gf_dhistogram(~ diffmean, data = DRP_sim, binwidth = 1, center = 0.35) %>% gf_fitdistr(dist = "norm") %>% gf_labs(x = "Difference between the treatment mean and the control mean") ```