\documentclass[11pt]{article}
\usepackage[margin=1in,bottom=.5in,includehead,includefoot]{geometry}
\usepackage{hyperref}
\usepackage{language}
\usepackage{alltt}
\usepackage{fancyhdr}
\pagestyle{fancy}
\fancyhf{}
%% Now begin customising things. See the fancyhdr docs for more info.
\chead{}
\lhead[\sf \thepage]{\sf \leftmark}
\rhead[\sf \leftmark]{\sf \thepage}
\lfoot{}
\cfoot{Introduction to the Practice of Statistics using R: Chapter 3}
\rfoot{}
\newcounter{myenumi}
\newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}}
\newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}}
\pagestyle{fancy}
\def\R{{\sf R}}
\def\Rstudio{{\sf RStudio}}
\def\RStudio{{\sf RStudio}}
\def\term#1{\textbf{#1}}
\def\tab#1{{\sf #1}}
\usepackage{relsize}
\newlength{\tempfmlength}
\newsavebox{\fmbox}
\newenvironment{fmpage}[1]
{
\medskip
\setlength{\tempfmlength}{#1}
\begin{lrbox}{\fmbox}
\begin{minipage}{#1}
\vspace*{.02\tempfmlength}
\hfill
\begin{minipage}{.95 \tempfmlength}}
{\end{minipage}\hfill
\vspace*{.015\tempfmlength}
\end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}}
\medskip
}
\newenvironment{boxedText}[1][.98\textwidth]%
{%
\begin{center}
\begin{fmpage}{#1}
}%
{%
\end{fmpage}
\end{center}
}
\newenvironment{boxedTable}[2][tbp]%
{%
\begin{table}[#1]
\refstepcounter{table}
\begin{center}
\begin{fmpage}{.98\textwidth}
\begin{center}
\sf \large Box~\expandafter\thetable. #2
\end{center}
\medskip
}%
{%
\end{fmpage}
\end{center}
\end{table} % need to do something about exercises that follow boxedTable
}
\newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}}
\title{Introduction to the Practice of Statistics using R: \\
Chapter 3}
\author{
Nicholas J. Horton\thanks{Department of Mathematics, Amherst College, nhorton@amherst.edu} \and Ben Baumer
}
\date{\today}
\begin{document}
\maketitle
\tableofcontents
%\parindent=0pt
<>=
opts_chunk$set(
dev="pdf",
tidy=FALSE,
fig.path="figures/",
fig.height=4,
fig.width=5,
out.width=".57\\textwidth",
fig.keep="high",
fig.show="hold",
fig.align="center",
prompt=TRUE, # show the prompts; but perhaps we should not do this
comment=NA
)
options(continue=" ")
@
<>=
print.pval = function(pval) {
threshold = 0.0001
return(ifelse(pval < threshold, paste("p<", sprintf("%.4f", threshold), sep=""),
ifelse(pval > 0.1, paste("p=",round(pval, 2), sep=""),
paste("p=", round(pval, 3), sep=""))))
}
@
<>=
require(mosaic)
trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice
set.seed(123)
# this allows for code formatting inline. Use \Sexpr{'function(x,y)'}, for exmaple.
knit_hooks$set(inline = function(x) {
if (is.numeric(x)) return(knitr:::format_sci(x, 'latex'))
x = as.character(x)
h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize'))
h = gsub("([_#$%&])", "\\\\\\1", h)
h = gsub('(["\'])', '\\1{}', h)
gsub('^\\\\begin\\{alltt\\}\\s*|\\\\end\\{alltt\\}\\s*$', '', h)
})
showOriginal=FALSE
showNew=TRUE
@
\section*{Introduction}
This document is intended to help describe how to undertake analyses introduced as examples in the Sixth Edition of \emph{Introduction to the
Practice of Statistics} (2009) by David Moore, George McCabe and Bruce
Craig.
More information about the book can be found at \url{http://bcs.whfreeman.com/ips6e/}. This
file as well as the associated \pkg{knitr} reproducible analysis source file can be found at
\url{http://www.math.smith.edu/~nhorton/ips6e}.
This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the
\pkg{mosaic} package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignette (\url{http://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf}).
To use a package within R, it must be installed (one time), and loaded (each session). The package can be installed using the following command:
<>=
install.packages('mosaic') # note the quotation marks
@
The {\tt \#} character is a comment in R, and all text after that on the
current line is ignored.
Once the package is installed (one time only), it can be loaded by running the command:
<>=
require(mosaic)
@
This
needs to be done once per session.
We also set some options to improve legibility of graphs and output.
<>=
trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice
options(digits=3)
@
The specific goal of this document is to demonstrate how to replicate the
analysis described in Chapter 3: Producing Data.
\section{Design of experiments}
\subsection{Randomizing subjects}
It's straightforward to randomly divide 40 students into two groups of 20 students
each (as described in Example 3.11 on page 185).
<<>>=
students = 1:40 # equivalent to seq(from=1, to=40, by=1)
group1 = sample(students, size=20)
sort(group1)
group2 = students[-group1] # all but those values are included
sort(group2)
@
\section{Sampling design}
\subsection{Simple random samples}
We reproduce a random sampling of resorts (from Figure 3.8, page 202).
<<>>=
resorts = c("Aloha Kai", "Captiva", "Palm Tree", "Sea Shell", "Anchor Down",
"Casa del Mar", "Radisson", "Silver Beach")
# generate a SRS of size 3
sampled = sample(resorts, size=3)
sampled
@
\section{Toward statistical inference}
\subsection{Simulate a random sample}
It's straightforward to use R to generate simple random samples.
Example 3.32 (page 214) describes how this is done by using a table
of random digits. It's more generalizable to do this with a set of possible
options each with specified probabilities (probability frustrated=0.6, probability
not-frustrated=0.4):
<<>>=
srs1 = sample(c("Frustrating", "Not-frustrating"), size=100, prob=c(0.6, 0.4),
replace=TRUE)
tally(srs1)
@
We can repeat the process, which will (generally) give different answers.
<<>>=
n = 100
n
tally(sample(c("Frustrating", "Not-frustrating"), size=n, prob=c(0.6, 0.4),
replace=TRUE))
tally(sample(c("Frustrating", "Not-frustrating"), size=n, prob=c(0.6, 0.4),
replace=TRUE))
tally(sample(c("Frustrating", "Not-frustrating"), size=n, prob=c(0.6, 0.4),
replace=TRUE))
@
We can repeat the process many times using the {\tt do()} function, which
saves the results.
<<>>=
res = do(1000) * tally(sample(c("Frustrating", "Not-frustrating"), size=n,
prob=c(0.6, 0.4), replace=TRUE))
histogram(~ Frustrating, xlab="Number reporting shopping frustrating", data=res)
@
We see that the sampling distribution for the number reporting \emph{Frustrating}
in m=1000 simple random samples each of size n=100 is centered at the value of around 60, which we
would expect since the true probability of being Frustrating is in fact $p=0.60$.
The results are equivalent if rescaled as a proportion (by dividing by the sample size).
<<>>=
sd(~ Frustrating/n, data=res)
histogram(~ Frustrating/n, xlab="Proportion reporting shopping frustrating", data=res)
@
What happens if we take samples of size n=2500 (as displayed in Example 3.33, on
pages 214--215).
<<>>=
n=2500
n
res = do(1000) * tally(sample(c("Frustrating", "Not-frustrating"), size=n,
prob=c(0.6, 0.4), replace=TRUE))
sd(~ Frustrating/n, data=res)
histogram(~ Frustrating/n, xlab="Proportion reporting shopping frustrating",
data=res)
@
The sampling distribution is much narrower, given the much larger sample size.
\subsection{Capture-recapture sampling}
R can be used as a calculator, as for the calculations in Example 3.34 (page 220).
<<>>=
200*120/12
@
\end{document}