\documentclass[11pt]{article} \usepackage[margin=1in,bottom=.5in,includehead,includefoot]{geometry} \usepackage{hyperref} \usepackage{language} \usepackage{alltt} \usepackage{fancyhdr} \pagestyle{fancy} \fancyhf{} %% Now begin customising things. See the fancyhdr docs for more info. \chead{} \lhead[\sf \thepage]{\sf \leftmark} \rhead[\sf \leftmark]{\sf \thepage} \lfoot{} \cfoot{Introduction to the Practice of Statistics using R: Chapter 12} \rfoot{} \newcounter{myenumi} \newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}} \newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}} \pagestyle{fancy} \def\R{{\sf R}} \def\Rstudio{{\sf RStudio}} \def\RStudio{{\sf RStudio}} \def\term#1{\textbf{#1}} \def\tab#1{{\sf #1}} \usepackage{relsize} \newlength{\tempfmlength} \newsavebox{\fmbox} \newenvironment{fmpage}[1] { \medskip \setlength{\tempfmlength}{#1} \begin{lrbox}{\fmbox} \begin{minipage}{#1} \vspace*{.02\tempfmlength} \hfill \begin{minipage}{.95 \tempfmlength}} {\end{minipage}\hfill \vspace*{.015\tempfmlength} \end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}} \medskip } \newenvironment{boxedText}[1][.98\textwidth]% {% \begin{center} \begin{fmpage}{#1} }% {% \end{fmpage} \end{center} } \newenvironment{boxedTable}[2][tbp]% {% \begin{table}[#1] \refstepcounter{table} \begin{center} \begin{fmpage}{.98\textwidth} \begin{center} \sf \large Box~\expandafter\thetable. #2 \end{center} \medskip }% {% \end{fmpage} \end{center} \end{table} % need to do something about exercises that follow boxedTable } \newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}} \title{Introduction to the Practice of Statistics using R: \\ Chapter 12} \author{ Nicholas J. Horton\thanks{Department of Mathematics, Amherst College, nhorton@amherst.edu} \and Ben Baumer } \date{\today} \begin{document} \maketitle \tableofcontents %\parindent=0pt <>= opts_chunk$set( dev="pdf", tidy=FALSE, fig.path="figures/", fig.height=4, fig.width=5, out.width=".57\\textwidth", fig.keep="high", fig.show="hold", fig.align="center", prompt=TRUE, # show the prompts; but perhaps we should not do this comment=NA ) options(continue=" ") @ <>= print.pval = function(pval) { threshold = 0.0001 return(ifelse(pval < threshold, paste("p<", sprintf("%.4f", threshold), sep=""), ifelse(pval > 0.1, paste("p=",round(pval, 2), sep=""), paste("p=", round(pval, 3), sep="")))) } @ <>= require(mosaic) trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice set.seed(123) # this allows for code formatting inline. Use \Sexpr{'function(x,y)'}, for exmaple. knit_hooks$set(inline = function(x) { if (is.numeric(x)) return(knitr:::format_sci(x, 'latex')) x = as.character(x) h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize')) h = gsub("([_#$%&])", "\\\\\\1", h) h = gsub('(["\'])', '\\1{}', h) gsub('^\\\\begin\\{alltt\\}\\s*|\\\\end\\{alltt\\}\\s*$', '', h) }) showOriginal=FALSE showNew=TRUE @ \section*{Introduction} This document is intended to help describe how to undertake analyses introduced as examples in the Sixth Edition of \emph{Introduction to the Practice of Statistics} (2002) by David Moore, George McCabe and Bruce Craig. More information about the book can be found at \url{http://bcs.whfreeman.com/ips6e/}. This file as well as the associated \pkg{knitr} reproducible analysis source file can be found at \url{http://www.math.smith.edu/~nhorton/ips6e}. This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the \pkg{mosaic} package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignette (\url{http://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf}). Additional examples of fitting multiple regression models can be found in the companion site which implements the examples within \emph{The Statistical Sleuth} in R (\url{http://www.math.smith.edu/~nhorton/sleuth}). To use a package within R, it must be installed (one time), and loaded (each session). The packages can be installed using the following command: <>= install.packages('mosaic') # note the quotation marks install.packages('gmodels') # note the quotation marks @ The {\tt \#} character is a comment in R, and all text after that on the current line is ignored. Once the package is installed (one time only), it can be loaded by running the command: <>= require(mosaic) require(gmodels) @ This needs to be done once per session. We also set some options to improve legibility of graphs and output. <>= trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice options(digits=3) @ The specific goal of this document is to demonstrate how to replicate the analysis described in Chapter 12: One-Way Analysis of Variance. \section{Inference for One-way ANOVA} \subsection{Exploratory analysis} We consider the case study on workplace safety introduced on page 641 (Example 12.3). <<>>= ds = read.csv("http://www.math.smith.edu/ips6e/Ch12/ex12_003.csv") favstats(SCI ~ jobcat, data=ds) @ Variants of the graphical displays (from Figure 12.3, page 642) are reproduced below. Note that the histograms (with overlaid normal curve) can be generated using separate stacked figures, or a single display can be created using overlapping density plots. <<>>= histogram(~ SCI| jobcat, fit="normal", layout=c(1, 3), data=ds) @ <<>>= densityplot(~ SCI, groups=jobcat, auto.key=TRUE, data=ds) @ <<>>= bwplot(SCI ~ jobcat, data=ds) @ \subsection{Pooled standard deviation} The pooled standard deviation can be easily calculated through the {\tt lm()} command: <<>>= ex12.5 = lm(SCI ~ jobcat, data=ds) summary(ex12.5) @ The value of 18.07 matches the results in Example 12.5 (page 647). \subsection{ANOVA table} The ANOVA table (Figure 12.8, page 649) can be generated from this linear model object. <<>>= anova(ex12.5) @ \subsection{Decomposition} As always, the total variability (SST) can be decomposed into part explained by the model (SSM or SSG, as described in Example 12.9, page 651) and part unexplained (SSE, or sums of squares for error). <<>>= meanval = mean(~ SCI, data=ds) SST = with(ds, sum((SCI - meanval)^2)) SST SSM = sum((fitted(ex12.5) - meanval)^2) SSM SSE = sum((residuals(ex12.5)^2)) SSE SSM + SSE @ We can use these results to verify the value of $s_p$ (the pooled estimate of the parameter $\sigma$) from our model. <<>>= MSE = SSE / (nrow(ds) - 2 - 1); MSE sqrt(MSE) @ This matches the value on page 652 (Example 12.11). \subsection{The F test} The F distribution is used to test the overall hypotheses (and other multiple degree of freedom tests). The p-value is the probability that a random variable having the $F(I-1, N-I)$ distribution is greater or equal to the calculated value of the F statistic. The values from Example 12.12 (page 653) can be found using the {\tt qf()} function: <<>>= qf(c(.90, .95, .975, .99, .999), df1 = 2, df2 = 587) @ (Note that the values in the book are only available for denominator degrees of freedom equal to 200, so the results are conservative). \subsection{Coefficient of determination} As usual, the $R^2$ (or coefficient of determination) can be calculated in multiple ways: <<>>= r.squared(ex12.5) SSM/SST @ \section{Comparing the means} \subsection{Contrasts} Contrasts can be used to calculate specific one degree of freedom tests of hypotheses. Recall the means from the worker data: <<>>= mean(SCI ~ jobcat, data=ds) @ We can also calculate these in terms of the regression parameter estimates: <<>>= mycoef = coef(ex12.5); mycoef mycoef[1] mycoef[1] + mycoef[2] mycoef[1] + mycoef[3] @ Contrasts can be fit using the {\tt fit.contrast()} function within the {\tt gmodels} package. <<>>= require(gmodels) fit.contrast(ex12.5, "jobcat", c(-1/2, 1, -1/2)) @ This matches the results for the first contrast (Example 12.18, pages 658--659). A similar process is used to test the second contrast (Example 12.20, page 659): <<>>= fit.contrast(ex12.5, "jobcat", c(-1, 0, 1)) @ These values can be used to calculate a 95\% confidence interval for the difference in means: <<>>= -0.79 + c(-1, 1)*qt(.975, df=587) * 2.08 @ \subsection{Multiple comparisons} A number of packages support the comparison of multiple tests using R (see for example the {\tt multcomp} package). \subsection{Power} The {\tt power.anova.test()} function can be used to calculate power and sample size for a one-way ANOVA. For the power of a reading comprehension study (Example 12.27, pages 668-669), this yields power of approximately 35\%. <<>>= power.anova.test(groups=3, n=10, within.var=7^2, between.var=var(c(41, 47, 44))) @ \end{document}