\documentclass[11pt]{article}
\usepackage[margin=1in,bottom=.5in,includehead,includefoot]{geometry}
\usepackage{hyperref}
\usepackage{language}
\usepackage{alltt}
\usepackage{fancyhdr}
\pagestyle{fancy}
\fancyhf{}
%% Now begin customising things. See the fancyhdr docs for more info.
\chead{}
\lhead[\sf \thepage]{\sf \leftmark}
\rhead[\sf \leftmark]{\sf \thepage}
\lfoot{}
\cfoot{Introduction to the Practice of Statistics using R: Chapter 16}
\rfoot{}
\newcounter{myenumi}
\newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}}
\newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}}
\pagestyle{fancy}
\def\R{{\sf R}}
\def\Rstudio{{\sf RStudio}}
\def\RStudio{{\sf RStudio}}
\def\term#1{\textbf{#1}}
\def\tab#1{{\sf #1}}
\usepackage{relsize}
\newlength{\tempfmlength}
\newsavebox{\fmbox}
\newenvironment{fmpage}[1]
{
\medskip
\setlength{\tempfmlength}{#1}
\begin{lrbox}{\fmbox}
\begin{minipage}{#1}
\vspace*{.02\tempfmlength}
\hfill
\begin{minipage}{.95 \tempfmlength}}
{\end{minipage}\hfill
\vspace*{.015\tempfmlength}
\end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}}
\medskip
}
\newenvironment{boxedText}[1][.98\textwidth]%
{%
\begin{center}
\begin{fmpage}{#1}
}%
{%
\end{fmpage}
\end{center}
}
\newenvironment{boxedTable}[2][tbp]%
{%
\begin{table}[#1]
\refstepcounter{table}
\begin{center}
\begin{fmpage}{.98\textwidth}
\begin{center}
\sf \large Box~\expandafter\thetable. #2
\end{center}
\medskip
}%
{%
\end{fmpage}
\end{center}
\end{table} % need to do something about exercises that follow boxedTable
}
\newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}}
\title{Introduction to the Practice of Statistics using R: \\
Chapter 16}
\author{
Ben Baumer \and Nicholas J. Horton\thanks{Department of Mathematics, Amherst College, nhorton@amherst.edu}
}
\date{\today}
\begin{document}
\maketitle
\tableofcontents
%\parindent=0pt
<>=
opts_chunk$set(
dev="pdf",
tidy=FALSE,
fig.path="figures/",
fig.height=4,
fig.width=5,
out.width=".57\\textwidth",
fig.keep="high",
fig.show="hold",
fig.align="center",
prompt=TRUE, # show the prompts; but perhaps we should not do this
comment=NA
)
options(continue=" ")
@
<>=
print.pval = function(pval) {
threshold = 0.0001
return(ifelse(pval < threshold, paste("p<", sprintf("%.4f", threshold), sep=""),
ifelse(pval > 0.1, paste("p=",round(pval, 2), sep=""),
paste("p=", round(pval, 3), sep=""))))
}
@
<>=
require(mosaic)
trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice
set.seed(123)
# this allows for code formatting inline. Use \Sexpr{'function(x,y)'}, for exmaple.
knit_hooks$set(inline = function(x) {
if (is.numeric(x)) return(knitr:::format_sci(x, 'latex'))
x = as.character(x)
h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize'))
h = gsub("([_#$%&])", "\\\\\\1", h)
h = gsub('(["\'])', '\\1{}', h)
gsub('^\\\\begin\\{alltt\\}\\s*|\\\\end\\{alltt\\}\\s*$', '', h)
})
showOriginal=FALSE
showNew=TRUE
@
\section*{Introduction}
This document is intended to help describe how to undertake analyses introduced as examples in the Sixth Edition of \emph{Introduction to the
Practice of Statistics} (2002) by David Moore, George McCabe and Bruce
Craig.
More information about the book can be found at \url{http://bcs.whfreeman.com/ips6e/}. This
file as well as the associated \pkg{knitr} reproducible analysis source file can be found at
\url{http://www.math.smith.edu/~nhorton/ips6e}.
This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the
\pkg{mosaic} package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignette (\url{http://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf}).
To use a package within R, it must be installed (one time), and loaded (each session). The package can be installed using the following command:
<>=
install.packages('mosaic') # note the quotation marks
@
The {\tt \#} character is a comment in R, and all text after that on the
current line is ignored.
Once the package is installed (one time only), it can be loaded by running the command:
<>=
require(mosaic)
@
This
needs to be done once per session.
We also set some options to improve legibility of graphs and output.
<>=
trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice
options(digits=3)
@
The specific goal of this document is to demonstrate how to replicate the
analysis described in Chapter 16: Bootstrap Methods and Permutation Tests.
\section{The Bootstrap Idea}
The bootstrap is a fundamental concept in statistical computing, and the requisite calculations are very easy to perform in {\tt R}.
The repair time data from Verizon shown in Figure 16.1 (page 16-4) can be plotted thusly:
<>=
verizon = read.csv("http://www.math.smith.edu/ips6eR/ch16/eg16_001.csv")
histogram(~time, data=verizon, nint=100)
with(verizon, qqnorm(time, ylab="Repair times (in hours)"))
@
A command to facilitate resampling within the {\tt mosaic} package is {\tt resample()}. We get our first example on page 16-5, which considers a subset of size $n=6$ from the Verizon dataset.
<<>>=
data = c(3.12, 0, 1.57, 19.67, 0.22, 2.2)
mean(data)
s1 = resample(data)
s1
mean(s1)
s2 = resample(data)
s2
mean(s2)
s3 = resample(data)
s3
mean(s3)
@
Note that the results shown here do not match the book, due to the random nature of resampling.
In Figure 16.3 (page 16-6) we visualize a bootstrap distribution. To construct such a thing, we use the {\tt do()} command, which simply repeats some operation many times, and collects the results in a data frame.
<>=
mean(~time, data=verizon)
mean(~time, data=resample(verizon))
mean(~time, data=resample(verizon))
mean(~time, data=resample(verizon))
bootstrap = do(1000) * mean(time, data=resample(verizon))
favstats(~result, data=bootstrap)
# Theoretical standard error
14.69 / sqrt(1664)
@
Note how the theoretical standard error (i.e. standard deviation of the sampling distribution of the mean) compares to the standard deviation from the bootstrap sample.
<>=
histogram(~result, data=bootstrap, fit="normal")
with(bootstrap, qqnorm(result, ylab="Mean repair times of resamples (in hours)"))
@
\section{First Steps in Using the Bootstrap}
Table 16.1 and Figure 16.6 (page 16-14) display residential and commercial real estate prices in Seattle.
<>=
seattle = read.csv("http://www.math.smith.edu/ips6eR/ch16/ta16_001.csv")
names(seattle) = c("price")
histogram(~price, data=seattle)
with(seattle, qqnorm(price, ylab="Selling Price (in $1000)"))
@
In this example we are working with the 25\% trimmed mean. To find the 25\% trimmed mean, we grab only the middle 50\% of the data, and compute the mean on this subset. This can be achieved using the {\tt trim} argument to {\tt mean()}.
<<>>=
mean(~price, trim=0.25, data=seattle)
bootstrap = do(1000) * mean(~price, trim=0.25, data=resample(seattle))
favstats(~result, data=bootstrap)
histogram(~result, data=bootstrap, fit="normal")
with(bootstrap, qqnorm(result, ylab="Means of resamples (in $1000)"))
@
We compute the bias as the difference between the average of the bootstrapped means and the trimmed mean from the original sample.
<<>>=
# bias
mean(~result, data=bootstrap) - mean(~price, trim=0.25, data=seattle)
@
The computation of the confidence interval in Example 16.5 (page 16-16) makes use of the $t$-distribution.
<<>>=
se.boot = sd(~result, bootstrap)
t.star = qt(0.975, df=49)
t.star
moe = t.star * se.boot
mean(~price, trim=0.25, data=seattle) + c(-moe, moe)
@
In Example 16.6, we compare the means of two groups of service providers.
<<>>=
CLEC = read.csv("http://www.math.smith.edu/ips6eR/ch16/eg16_006.csv")
mean(Time ~ Group, data=CLEC)
densityplot(~Time, groups=Group, data=CLEC)
@
We then construct a bootstrap distribution for the difference in means among the two groups.
<<>>=
bstrap = do(1000) * diff(mean(Time ~ Group, data=resample(CLEC)))
favstats(~ILEC, data=bstrap)
@
Note that the resulting distribution is not quite so normal. Thus, we can use the quantile method to produce a bootstrap percentile confidence interval for the mean.
<<>>=
histogram(~ILEC, fit="normal", data=bstrap)
qdata(c(0.025, 0.975), vals=ILEC, data=bstrap)
@
\section{How Accurate is a Bootstrap Distribution?}
\section{Bootstrap Confidence Intervals}
We return to the construction of a confidence interval for the mean price of real estate in Seattle explored in Example 16-5. To the $t$-based confidence interval we constructed previously, we can add the percentile-based confidence interval
<<>>=
mean(~price, trim=0.25, data=seattle) + c(-moe, moe)
qdata(c(0.025, 0.975), vals=result, data=bootstrap)
@
Note that the bootstrapped confidence interval is not quite symmetric with respect to the sample mean of 244.
\subsection{Confidence intervals for the correlation}
In Example 16.10 (page 16-35), we explore the correlation between batting average and player salary in Major League Baseball. The value of the correlation coefficient among the 50 players in Table 16.2 (page 16-36) is relatively small.
<<>>=
MLB = read.csv("http://www.math.smith.edu/ips6eR/ch16/ta16_002.csv")
names(MLB)[2] = "Salary"
xyplot(Salary ~ Average, data=MLB, xlab="Batting Average"
, ylab="Salary (in millions of dollars)")
with(MLB, cor(Salary, Average))
@
To construct a bootstrap distribution for the correlation between batting average and salary, we resample the players and compute the correlation coefficient.
<<>>=
cor.boot = do(1000) * with(resample(MLB), cor(Salary, Average))
histogram(~result, data=cor.boot, fit="normal")
with(cor.boot, qqnorm(result, ylab="Correlation Coefficient"))
@
In this case, the $t$-based confidence interval for the correlation coefficient
<<>>==
se.boot = sd(~result, cor.boot)
t.star = qt(0.975, df=(nrow(MLB) - 1))
t.star
moe = t.star * se.boot
with(MLB, cor(Salary, Average)) + c(-moe, moe)
@
is in reasonable agreement with the percentile-based method.
<<>>=
qdata(c(0.025, 0.975), vals=result, data=cor.boot)
@
\end{document}