\documentclass[11pt]{article} \usepackage[margin=1in,bottom=.5in,includehead,includefoot]{geometry} \usepackage{hyperref} \usepackage{language} \usepackage{alltt} \usepackage{fancyhdr} \pagestyle{fancy} \fancyhf{} %% Now begin customising things. See the fancyhdr docs for more info. \chead{} \lhead[\sf \thepage]{\sf \leftmark} \rhead[\sf \leftmark]{\sf \thepage} \lfoot{} \cfoot{Introduction to the Practice of Statistics using R: Chapter 6} \rfoot{} \newcounter{myenumi} \newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}} \newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}} \pagestyle{fancy} \def\R{{\sf R}} \def\Rstudio{{\sf RStudio}} \def\RStudio{{\sf RStudio}} \def\term#1{\textbf{#1}} \def\tab#1{{\sf #1}} \usepackage{relsize} \newlength{\tempfmlength} \newsavebox{\fmbox} \newenvironment{fmpage}[1] { \medskip \setlength{\tempfmlength}{#1} \begin{lrbox}{\fmbox} \begin{minipage}{#1} \vspace*{.02\tempfmlength} \hfill \begin{minipage}{.95 \tempfmlength}} {\end{minipage}\hfill \vspace*{.015\tempfmlength} \end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}} \medskip } \newenvironment{boxedText}[1][.98\textwidth]% {% \begin{center} \begin{fmpage}{#1} }% {% \end{fmpage} \end{center} } \newenvironment{boxedTable}[2][tbp]% {% \begin{table}[#1] \refstepcounter{table} \begin{center} \begin{fmpage}{.98\textwidth} \begin{center} \sf \large Box~\expandafter\thetable. #2 \end{center} \medskip }% {% \end{fmpage} \end{center} \end{table} % need to do something about exercises that follow boxedTable } \newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}} \title{Introduction to the Practice of Statistics using R: \\ Chapter 6} \author{ Ben Baumer \and Nicholas J. Horton\thanks{Department of Mathematics, Amherst College, nhorton@amherst.edu} } \date{\today} \begin{document} \maketitle \tableofcontents %\parindent=0pt <>= opts_chunk$set( dev="pdf", tidy=FALSE, fig.path="figures/", fig.height=4, fig.width=5, out.width=".57\\textwidth", fig.keep="high", fig.show="hold", fig.align="center", prompt=TRUE, # show the prompts; but perhaps we should not do this comment=NA ) options(continue=" ") @ <>= print.pval = function(pval) { threshold = 0.0001 return(ifelse(pval < threshold, paste("p<", sprintf("%.4f", threshold), sep=""), ifelse(pval > 0.1, paste("p=",round(pval, 2), sep=""), paste("p=", round(pval, 3), sep="")))) } @ <>= require(mosaic) trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice set.seed(43) # this allows for code formatting inline. Use \Sexpr{'function(x,y)'}, for exmaple. knit_hooks$set(inline = function(x) { if (is.numeric(x)) return(knitr:::format_sci(x, 'latex')) x = as.character(x) h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize')) h = gsub("([_#$%&])", "\\\\\\1", h) h = gsub('(["\'])', '\\1{}', h) gsub('^\\\\begin\\{alltt\\}\\s*|\\\\end\\{alltt\\}\\s*$', '', h) }) showOriginal=FALSE showNew=TRUE @ \section*{Introduction} This document is intended to help describe how to undertake analyses introduced as examples in the Sixth Edition of \emph{Introduction to the Practice of Statistics} (2009) by David Moore, George McCabe and Bruce Craig. More information about the book can be found at \url{http://bcs.whfreeman.com/ips6e/}. This file as well as the associated \pkg{knitr} reproducible analysis source file can be found at \url{http://www.math.smith.edu/~nhorton/ips6e}. This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the \pkg{mosaic} package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignette (\url{http://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf}). To use a package within R, it must be installed (one time), and loaded (each session). The package can be installed using the following command: <>= install.packages('mosaic') # note the quotation marks @ The {\tt \#} character is a comment in R, and all text after that on the current line is ignored. Once the package is installed (one time only), it can be loaded by running the command: <>= require(mosaic) @ This needs to be done once per session. We also set some options to improve legibility of graphs and output. <>= trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice options(digits=3) @ The specific goal of this document is to demonstrate how to replicate the analysis described in Chapter 6: Introduction to Inference. \section{Estimating with Confidence} First, let's generate a random sample of 500 SAT scores drawn from a normal distribution with mean 500 and standard deviation 100. To do this we use the {\tt rnorm()} function, which draws from a normal distribution. <<>>= mu = 500 sigma = 100 x = rnorm(500, mean=mu, sd=sigma) favstats(x) @ To compute a confidence interval for the mean, we'll use a simple function that finds a confidence interval for the mean of any vector of data $x$, given a specified significance level and the true (assumed known) population standard deviation. Note that 95\% is the default level of confidence. <<>>= meanconfint = function (x, sigma, level = 0.95, ...) { se = sigma / sqrt(length(x)) mu = mean(x) z = qnorm(1 - (1 - level)/2) out = c(mu, mu - z * se, mu + z * se) names(out) = c("mean", "lower", "upper") return(out) } meanconfint(x, sigma = sigma) @ At the bottom of page 358, many such confidence intervals are calculated. We can simulate this using our function. The {\tt do()} function will repeat any operation a specified number of times, and return a data frame of the results. The {\tt apply()} family of functions provide a powerful way to apply an operation to the rows or columns of a data frame. Here it lets us repeat an operation for each of the 50 sets of 500 random numbers. <<>>= randomx = do(50) * rnorm(500, mean=mu, sd=sigma) ci = data.frame(t(apply(randomx, 1, meanconfint, sigma=sigma))) head(ci, 3) @ Let's try to visualize these intervals in a manner analogous to the plot on the bottom of page 358. <>= xyplot(1:nrow(ci) ~ mean, data=ci, xlim=range(ci), xlab="SAT score", ylab="Index") ladd(panel.abline(v=500, col="lightgray", lty=2)) ladd(with(ci, panel.arrows(x0 = lower, y0=1:nrow(ci), y1=1:nrow(ci), cex=0.5, x1=upper, code=3))) @ We see that sometimes (e.g. simulation 41) the confidence interval does \emph{not} cover the true population mean (500 points). Note that we can consider confidence levels other than $0.95$ by specifying the {\tt level} argument. Here's how we compute a 90\% confidence interval. <<>>= head(t(apply(randomx, 1, meanconfint, sigma=sigma, level=0.9)), 3) @ The 90\% confidence intervals are narrower than the 95\% confidence intervals, since we sacrifice some accuracy in exchange for increased confidence that the interval will contain the true mean. In Example 6.4 (page 361), we are asked to compute a 95\% confidence interval for a sample mean of \$18,900 in undergraduate debt, computed from a sample of 1280 borrowers. The standard deviation of the population is known to be \$49,000. Since we want a 95\% confidence interval, we need to find the $z$-score that corresponds to $0.025$ (or equivalently $0.0975$), since 95\% of the standard normal distribution lies between these two values. <<>>= z.star = qnorm(0.975) z.star @ Then we compute the margin or error and the confidence interval by: <<>>= se = z.star * (49000) / sqrt(1280) se 18900 + c(-se, se) @ In Example 6.6 (page 364), we change the confidence level to 99\%. Thus, we need to compute a different value of $z^*$. <<>>= z.star2 = qnorm(0.995) z.star2 se2 = z.star2 * (49000) / sqrt(1280) se2 18900 + c(-se2, se2) @ \subsection{Beyond the Basics} We'll discuss the bootstrap in much greater detail in Chapter 16. Here, we can use the {\tt resample()} function from {\tt mosaic} to quickly compute a bootstrap sample. <<>>= time = c(190.5, 109, 95.5, 137) resample(time) bootstrap = do(1000) * mean(resample(time)) densityplot(~result, data=bootstrap) @ \section{Tests of Significance} In Example 6.12 (page 378), we compute a $p$-value for the observed difference of \$4,100. Note that we need to multiply the cumulative probability in the right-hand tail by 2 for a two-sided test. <<6.12>>= z = (4100 - 0) / 3000 z 2 * (1 - pnorm(z)) @ The $z$-test for a population mean on page 383 can be computed using the {\tt pbinom()}. <<>>= # one-sided test for right tail probability pnorm(2, lower.tail=FALSE) # one-sided test for left tail probability pnorm(-2) # two-sided test 2 * pnorm(2, lower.tail=FALSE) @ In Example 6.16 (page 385), we find the right-hand tail probability. <<>>= pnorm(461, mean=450, sd=100 / sqrt(500), lower.tail=FALSE) xpnorm(2.46) @ \section{Use and Abuse of Tests} In Example 6.84 (page 396), we test for significance. Note the use of a one-sided test. <<>>= z1 = (541.4 - 525) / (100 / sqrt(100)) pnorm(z1, lower.tail=FALSE) z2 = (541.5 - 525) / (100 / sqrt(100)) pnorm(z2, lower.tail=FALSE) @ \section{Power and Inference as a Decision} Example 6.29 (page 402) considers the power for a study with n=25 subjects, where a one-sided alternative is tested at an $\alpha$ level of 0.05 and the population standard deviation is assumed known and equals $\sigma=2$. <<>>= xqnorm(.95, mean=0, sd=2/sqrt(25)) @ We can now compare this to the distribution when the alternative is true ($\mu=1$). <<>>= xpnorm(0.658, mean=1, sd=2/sqrt(25)) @ We see that the power is 0.80. \end{document}