--- title: "SDM4 in R: Sampling Distribution Models (Chapter 17)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "August 12, 2017" output: pdf_document: fig_height: 2.8 fig_width: 6 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include=FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages require(mosaic) ``` ```{r, include=FALSE} # Some customization. You can alter or delete as desired (if you know what you are doing). # This changes the default colors in lattice plots. trellis.par.set(theme=theme.mosaic()) # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy=FALSE, # display code as typed size="small" # slightly smaller font for code ) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fourth Edition of \emph{Stats: Data and Models} (2014) by De Veaux, Velleman, and Bock. More information about the book can be found at http://wps.aw.com/aw_deveaux_stats_series. This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/sdm4. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. ## Chapter 17: Sampling Distribution Models ### Section 17.1: Sampling distribution of a proportion Let's regenerate Figure 17.1 (page 444). ```{r} library(mosaic); options(digits=3) numsim <- 2000 n <- 1022 p <- 0.57 samples <- rbinom(numsim, size=n, prob=p)/n histogram(~ samples, xlab="Distribution of sample proportions", width=0.01, center=0.01/2, type="count") plotDist("norm", params=list(p, sqrt(p*(1-p)/n)), xlab="p", ylab="f(p)") ``` ### Section 17.2: When does the normal model work? We can replicate the example from page 449: ```{r} p <- 0.22; n <- 200 pnorm(.155, mean=p, sd=sqrt(p*(1-p)/n)) # normal approximation pbinom(31, size=n, prob=p) # exact value ``` ### Section 17.3: The sampling distribution of other statistics Let's replicate the display on page 451: ```{r} BodyFat <- read.csv("http://nhorton.people.amherst.edu/sdm4/data/Body_fat_complete.csv") medians <- do(2000)*median(~ Weight, data=sample(BodyFat, 10, replace=FALSE)) histogram(~ median, xlab="Sampling distribution of medians", width=5, center=5/2, data=medians) variances <- do(2000)*var(~ Weight, data=sample(BodyFat, 10, replace=FALSE)) histogram(~ var, xlab="Sampling distribution of variances", width=100, center=100/2, data=variances) minimums <- do(2000)*min(~ Weight, data=sample(BodyFat, 10, replace=FALSE)) histogram(~ min, xlab="Sampling distribution of minimums", width=2, center=2/2, data=minimums) ``` Neither of the sampling distributions of the variance or the minimums are normally distributed. ### Section 17.4: Central Limit Theorem Let's replicate the displays on pages 453-454: ```{r message=FALSE} require(readr) CEO <- read_delim("http://nhorton.people.amherst.edu/sdm4/data/CEO_Salary_2012.txt", delim="\t") CEO <- mutate(CEO, Pay = One_Year_Pay*1000) favstats(~ Pay, data=CEO) ``` Note that Figure 17.11 seems to be off by a factor of 10! ```{r} histogram(~ Pay, xlab="CEO Compensation in $1000", width=10000, center=10000/2-.01, data=CEO) ``` ```{r} samp10 <- do(2000)*mean(~ Pay, data=sample(CEO, 10)) histogram(~ mean, xlab="means of CEO Compensation (samples of size n=10)", width=2000, center=2000/2, data=samp10) ``` ```{r} samp50 <- do(2000)*mean(~ Pay, data=sample(CEO, 50)) histogram(~ mean, xlab="means of CEO Compensation (samples of size n=50)", width=1000, center=1000/2, data=samp50) ``` ```{r} samp100 <- do(2000)*mean(~ Pay, data=sample(CEO, 100)) histogram(~ mean, xlab="means of CEO Compensation (samples of size n=100)", width=500, center=500/2, data=samp100) ``` ```{r} samp200<- do(2000)*mean(~ Pay, data=sample(CEO, 200)) histogram(~ mean, xlab="means of CEO Compensation (samples of size n=200)", width=500, center=500/2, data=samp200) ``` Note how the axis limits get narrower as the sample size increases (since the means are less variable when the sample size increases: ```{r} mysd <- sd(~ Pay, data=CEO); mysd sd(~ mean, data=samp10) # what we observed mysd/sqrt(10) # what we would expect ``` We can repeat this comparison for each of the sets of samples. ```{r} sd(~ mean, data=samp50) sd(~ mean, data=samp100) sd(~ mean, data=samp200) ```