\documentclass[11pt]{article} \usepackage[margin=1in,bottom=.5in,includehead,includefoot]{geometry} \usepackage{hyperref} \usepackage{language} \usepackage{alltt} \usepackage{fancyhdr} \pagestyle{fancy} \fancyhf{} %% Now begin customising things. See the fancyhdr docs for more info. \chead{} \lhead[\sf \thepage]{\sf \leftmark} \rhead[\sf \leftmark]{\sf \thepage} \lfoot{} \cfoot{Statistical Sleuth in R: Chapter 4} \rfoot{} \newcounter{myenumi} \newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}} \newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}} \pagestyle{fancy} \def\R{{\sf R}} \def\Rstudio{{\sf RStudio}} \def\RStudio{{\sf RStudio}} \def\term#1{\textbf{#1}} \def\tab#1{{\sf #1}} \usepackage{relsize} \newlength{\tempfmlength} \newsavebox{\fmbox} \newenvironment{fmpage}[1] { \medskip \setlength{\tempfmlength}{#1} \begin{lrbox}{\fmbox} \begin{minipage}{#1} \vspace*{.02\tempfmlength} \hfill \begin{minipage}{.95 \tempfmlength}} {\end{minipage}\hfill \vspace*{.015\tempfmlength} \end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}} \medskip } \newenvironment{boxedText}[1][.98\textwidth]% {% \begin{center} \begin{fmpage}{#1} }% {% \end{fmpage} \end{center} } \newenvironment{boxedTable}[2][tbp]% {% \begin{table}[#1] \refstepcounter{table} \begin{center} \begin{fmpage}{.98\textwidth} \begin{center} \sf \large Box~\expandafter\thetable. #2 \end{center} \medskip }% {% \end{fmpage} \end{center} \end{table} % need to do something about exercises that follow boxedTable } \newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}} \title{The Statistical Sleuth in R: \\ Chapter 4} \author{ Ruobing Zhang \and Kate Aloisio \and Nicholas J. Horton\thanks{Department of Mathematics, Amherst College, nhorton@amherst.edu} } \date{\today} \begin{document} \maketitle \tableofcontents %\parindent=0pt <>= print.pval = function(pval) { threshold = 0.0001 return(ifelse(pval < threshold, paste("p<", sprintf("%.4f", threshold), sep=""), ifelse(pval > 0.1, paste("p=",round(pval, 2), sep=""), paste("p=", round(pval, 3), sep="")))) } @ <>= opts_chunk$set( dev="pdf", fig.path="figures/", fig.height=3, fig.width=4, out.width=".47\\textwidth", fig.keep="high", fig.show="hold", fig.align="center", prompt=TRUE, # show the prompts; but perhaps we should not do this comment=NA # turn off commenting of ouput (but perhaps we should not do this either ) @ <>= require(Sleuth2) require(mosaic) trellis.par.set(theme=col.mosaic()) # get a better color scheme set.seed(123) # this allows for code formatting inline. Use \Sexpr{'function(x,y)'}, for exmaple. knit_hooks$set(inline = function(x) { if (is.numeric(x)) return(knitr:::format_sci(x, 'latex')) x = as.character(x) h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize')) h = gsub("([_#$%&])", "\\\\\\1", h) h = gsub('(["\'])', '\\1{}', h) gsub('^\\\\begin\\{alltt\\}\\s*|\\\\end\\{alltt\\}\\s*$', '', h) }) showOriginal=FALSE showNew=TRUE @ \section{Introduction} This document is intended to help describe how to undertake analyses introduced as examples in the Second Edition of the \emph{Statistical Sleuth} (2002) by Fred Ramsey and Dan Schafer. More information about the book can be found at \url{http://www.proaxis.com/~panorama/home.htm}. This file as well as the associated \pkg{knitr} reproducible analysis source file can be found at \url{http://www.amherst.edu/~nhorton/sleuth}. This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the \pkg{mosaic} package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignette (\url{http://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf}). To use a package within R, it must be installed (one time), and loaded (each session). The package can be installed using the following command: <>= install.packages('mosaic') # note the quotation marks @ Once this is installed, it can be loaded by running the command: <>= require(mosaic) @ This needs to be done once per session. In addition the data files for the \emph{Sleuth} case studies can be accessed by installing the \pkg{Sleuth2} package. <>= install.packages('Sleuth2') # note the quotation marks @ <>= require(Sleuth2) @ We also set some options to improve legibility of graphs and output. <>= trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice options(digits=3, show.signif.stars=FALSE) @ The specific goal of this document is to demonstrate how to calculate the quantities described in Chapter 4: The Rank-Sum Test using R. \section{Space Shuttle O-Ring Failures} Does launch temperature tend to cause O-ring incidents? This is the question being addressed by case study 4.1 in the \emph{Sleuth}. \subsection{Summary statistics and graphical display} We begin by reading the data and summarizing the variables. <<>>= summary(case0401) favstats(Incidents ~ Launch, data=case0401) @ A total of \Sexpr{nrow(case0401)} subjects are included in the data: \Sexpr{nrow(subset(case0401, Launch=="Cool"))} O-ring incidents when the temperature was cold and \Sexpr{nrow(subset(case0401, Launch=="Warm"))} incidents when the temperature was warm (Display 4.1, page 86). <<>>= histogram(~ Incidents | Launch, data=case0401) @ \subsection{Permutation test on t-statistics} To replicate the permutation test performed on page 96 we use the following code, which first calculates the $t$-statistic of the observed outcome. <<>>= t.test(Incidents ~ Launch, var.equal=TRUE, data=case0401) @ We observe a test statistic of \Sexpr{t.test(Incidents ~ Launch, var.equal=TRUE, data=case0401)$statistic}. We want to get the total number of regroupings by calculating \emph{$C_{24,4}$}. <<>>= C244=factorial(24)/(factorial(4)*factorial(24-4)); C244 @ There are a total of \Sexpr{C244} regroupings with 8 possible (non-equiprobable) outcomes: (0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 0, 2), (0, 0, 0, 3), (0, 0, 1, 1), (0, 0, 1, 2), (0, 0, 1, 3), (0, 0, 2, 3), (0, 1, 1, 1), (0, 1, 1, 2), (0, 1, 1, 3), (0, 1, 2, 3), (1, 1, 1, 1), (1, 1, 1, 2), (1, 1, 1, 3), (1, 1, 2, 3). Because the observed cold temperature outcomes was (1, 1, 1, 3), we will only examine the same or more extreme groupings, which are (1, 1, 2, 3) and (0, 1, 2, 3). <<>>= # t.test for (1, 1, 2, 3) # observations 1, 2, 4 and 24 case0401$Incidents[c(1,2,4,24)] with(case0401, t.test(Incidents[c(1,2,4,24)], Incidents[-c(1,2,4,24)], var.equal=TRUE)) # t.test for (0, 1, 2, 3) # observation 1, 4, 5 and 24 case0401$Incidents[c(1,4,5,24)] with(case0401, t.test(Incidents[c(1,4,5,24)], Incidents[-c(1,4,5,24)], var.equal=TRUE)) @ The test statistic for (1, 1, 2, 3) is \Sexpr{t.test(case0401[c(1,2,4,24),]$Incidents, case0401[-c(1,2,4,24),]$Incidents, var.equal=TRUE)$statistic} and the test statistic for (0, 1, 2, 3) is \Sexpr{t.test(case0401[c(1,4,5,24),]$Incidents, case0401[-c(1,4,5,24),]$Incidents, var.equal=TRUE)$statistic}. We already know that the total number of regroupings is \emph{$C_{24,4}$}=\Sexpr{C244}. In order to calculate the $p$-value, we need to know the combinations of (1, 1, 1, 3), (2, 1, 2, 3) and (0, 1, 2, 3). There are 17 zeros, 5 ones, 1 two and 1 three. For outcome (1, 1, 1, 3), we calculate C1113=\emph{$C_{5,3}$}*\emph{$C_{1,1}$}: <<>>= C1113 = factorial(5)/(factorial(3)*factorial(5-3))*1; C1113 @ For outcome (1, 1, 2, 3), we calculate C1123=\emph{$C_{5,2}$}*\emph{$C_{1,1}$}*\emph{$C_{1,1}$}: <<>>= C1123 = factorial(5)/(factorial(2)*factorial(5-2))*1*1; C1123 @ For outcome (0, 1, 2, 3), we calculate C0123=\emph{$C_{17,1}$}*\emph{$C_{5,1}$}*\emph{$C_{1,1}$}*\emph{$C_{1,1}$} <<>>= C0123 = 17*5*1*1; C0123 @ Now we can calculate the $p$-value as the proportion of the number of rearrangements that are as extreme or more extreme over the total number of rearrangements: <<>>= onep = (C1113+C1123+C0123)/C244; onep @ The one-sided $p$-value from the permutation test on the $t$-statistic is \Sexpr{round(onep, 2)}. Alternatively, we can approximate the $p$-value using the difference of means and simulating repeatedly from the null distribution (note that the book enumerates all of the possible outcomes to get an exact result). <>= result = t.test(Incidents ~ Launch, var.equal=TRUE, data=case0401)$statistic; result nulldist = do(10000)*t.test(Incidents ~ shuffle(Launch), var.equal=TRUE, data=case0401)$statistic histogram(~ t, groups=t >= result, v=result, data=nulldist) tally(~ t >= result, format="proportion", data=nulldist) @ This simulation resulted in a $p$-value of \Sexpr{tally(~ t >= result, format="proportion", data=nulldist)["TRUE"]}. \section{Cognitive Load Theory in Teaching} Does use of modified instructional materials lead to quicker problem solving? That's the question being addressed by case study 4.2 in the \emph{Sleuth}. \subsection{Summary statistics and graphical display} We begin by reading the data and summarizing the variables. <<>>= teaching = read.csv("http://www.math.smith.edu/~nhorton/sleuth/case0402.csv") summary(case0402) favstats(Time ~ Treatmt, data=case0402) @ A total of \Sexpr{nrow(case0402)} subjects are included in the data: \Sexpr{nrow(subset(case0402, Treatmt=="Modified"))} students were assigned to modified instructional materials and \Sexpr{nrow(subset(case0402, Treatmt=="Conventional"))} students were assigned to conventional materials. <>= bwplot(Treatmt ~ Time, data=case0402) @ <>= densityplot(~ Time, groups=Treatmt, auto.key=TRUE, data=case0402) @ \subsection{Rank-sum test} We can calculate the one-sided $p$-value by following rank-sum procedure. First, we try to find the statistic T (display 4.5, page 91): <<>>= obsrank = rank(case0402$Time, ties.method="average"); obsrank mt = sum(obsrank[1:14]); mt @ Next we calculate the $p$-value using a normal approximation (Display 4.7, page 93). <<>>= average = mean(obsrank); average sd = sd(obsrank); sd n = nrow(subset(case0402, Treatmt=="Modified")); n MEANT = n * average; MEANT SDT = sd * sqrt((n^2)/(2*n)); SDT z = (mt-MEANT)/SDT; z p = pnorm(-abs(z)); p @ The one-sided $p$-value is \Sexpr{p}. Alternatively, we can use following code to calculate the Wilcoxon rank-sum test: <<>>= wilcox.test(Time ~ Treatmt, conf.int=TRUE, exact=TRUE, data=case0402) @ So the one-sided $p$-value is \Sexpr{round((wilcox.test(Time ~ Treatmt, conf.int=TRUE, var.equal=TRUE, data=case0402)$p.value/2), 3)}. The 95\% confidence interval is (\Sexpr{round(wilcox.test(Time ~ Treatmt, conf.int=TRUE, data=case0402)$conf.int[1:1], 1)}, \Sexpr{round(wilcox.test(Time ~ Treatmt, conf.int=TRUE, data=case0402)$conf.int[2:2], 1)}). The book suggests that the 95\% confidence interval should be (58, 159), which is slightly narrower than these results. Their procedure is on page 94. \end{document}