\documentclass[11pt]{article} \usepackage[margin=1in,bottom=.5in,includehead,includefoot]{geometry} \usepackage{hyperref} \usepackage{language} \usepackage{alltt} \usepackage{fancyhdr} \pagestyle{fancy} \fancyhf{} %% Now begin customising things. See the fancyhdr docs for more info. \chead{} \lhead[\sf \thepage]{\sf \leftmark} \rhead[\sf \leftmark]{\sf \thepage} \lfoot{} \cfoot{Introduction to the Practice of Statistics using R: Chapter 11} \rfoot{} \newcounter{myenumi} \newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}} \newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}} \pagestyle{fancy} \def\R{{\sf R}} \def\Rstudio{{\sf RStudio}} \def\RStudio{{\sf RStudio}} \def\term#1{\textbf{#1}} \def\tab#1{{\sf #1}} \usepackage{relsize} \newlength{\tempfmlength} \newsavebox{\fmbox} \newenvironment{fmpage}[1] { \medskip \setlength{\tempfmlength}{#1} \begin{lrbox}{\fmbox} \begin{minipage}{#1} \vspace*{.02\tempfmlength} \hfill \begin{minipage}{.95 \tempfmlength}} {\end{minipage}\hfill \vspace*{.015\tempfmlength} \end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}} \medskip } \newenvironment{boxedText}[1][.98\textwidth]% {% \begin{center} \begin{fmpage}{#1} }% {% \end{fmpage} \end{center} } \newenvironment{boxedTable}[2][tbp]% {% \begin{table}[#1] \refstepcounter{table} \begin{center} \begin{fmpage}{.98\textwidth} \begin{center} \sf \large Box~\expandafter\thetable. #2 \end{center} \medskip }% {% \end{fmpage} \end{center} \end{table} % need to do something about exercises that follow boxedTable } \newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}} \title{Introduction to the Practice of Statistics using R: \\ Chapter 11} \author{ Nicholas J. Horton\thanks{Department of Mathematics, Amherst College, nhorton@amherst.edu} \and Ben Baumer } \date{\today} \begin{document} \maketitle \tableofcontents %\parindent=0pt <>= opts_chunk$set( dev="pdf", tidy=FALSE, fig.path="figures/", fig.height=4, fig.width=5, out.width=".57\\textwidth", fig.keep="high", fig.show="hold", fig.align="center", prompt=TRUE, # show the prompts; but perhaps we should not do this comment=NA ) options(continue=" ") @ <>= print.pval = function(pval) { threshold = 0.0001 return(ifelse(pval < threshold, paste("p<", sprintf("%.4f", threshold), sep=""), ifelse(pval > 0.1, paste("p=",round(pval, 2), sep=""), paste("p=", round(pval, 3), sep="")))) } @ <>= require(mosaic) trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice set.seed(123) # this allows for code formatting inline. Use \Sexpr{'function(x,y)'}, for exmaple. knit_hooks$set(inline = function(x) { if (is.numeric(x)) return(knitr:::format_sci(x, 'latex')) x = as.character(x) h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize')) h = gsub("([_#$%&])", "\\\\\\1", h) h = gsub('(["\'])', '\\1{}', h) gsub('^\\\\begin\\{alltt\\}\\s*|\\\\end\\{alltt\\}\\s*$', '', h) }) showOriginal=FALSE showNew=TRUE @ \section*{Introduction} This document is intended to help describe how to undertake analyses introduced as examples in the Sixth Edition of \emph{Introduction to the Practice of Statistics} (2009) by David Moore, George McCabe and Bruce Craig. More information about the book can be found at \url{http://bcs.whfreeman.com/ips6e/}. This file as well as the associated \pkg{knitr} reproducible analysis source file can be found at \url{http://www.math.smith.edu/~nhorton/ips6e}. This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the \pkg{mosaic} package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignette (\url{http://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf}). Additional examples of fitting multiple regression models can be found in the companion site which implements the examples within \emph{The Statistical Sleuth} in R (\url{http://www.math.smith.edu/~nhorton/sleuth}). To use a package within R, it must be installed (one time), and loaded (each session). The package can be installed using the following command: <>= install.packages('mosaic') # note the quotation marks @ The {\tt \#} character is a comment in R, and all text after that on the current line is ignored. Once the package is installed (one time only), it can be loaded by running the command: <>= require(mosaic) @ This needs to be done once per session. We also set some options to improve legibility of graphs and output. <>= trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice options(digits=3) @ The specific goal of this document is to demonstrate how to replicate the analysis described in Chapter 11: Multiple Regression. \section{Case study: GPA for computer science majors} \subsection{Univariate analyses} As always, we begin with a description of the predictor variables and outcome (as displayed on pages 615 and 616). <<>>= ds = read.csv("http://www.math.smith.edu/ips6e/appendix/csdata.csv") names(ds) @ <<>>= favstats(~ gpa, data=ds) @ <<>>= favstats(~ satm, data=ds) @ <<>>= favstats(~ satv, data=ds) @ <<>>= favstats(~ hsm, data=ds) @ <<>>= favstats(~ hss, data=ds) @ <<>>= favstats(~ hse, data=ds) @ <<>>= tally(~ sex, data=ds) @ <<>>= tally(~ sex, format="percent", data=ds) @ <<>>= tally(~ hsm, data=ds) @ <<>>= tally(~ hss, data=ds) @ <<>>= tally(~ hse, data=ds) @ <<>>= densityplot(~ gpa, data=ds) @ <<>>= densityplot(~ satm, data=ds) @ <<>>= densityplot(~ satv, data=ds) @ \subsection{Bivariate comparisons} We can replicate the correlation matrix in Figure 11.3 (page 617). <<>>= smallds = subset(ds, select=c("gpa", "satm", "satv", "hsm", "hss", "hse")) with(ds, cor(smallds)) @ A graphical display may also be helpful: <<>>= pairs(smallds, pch=".") @ Note: jittering the categorical high school scores may improve the readability: <<>>= ds = transform(ds, jhsm = jitter(hsm)) ds = transform(ds, jhss = jitter(hss)) ds = transform(ds, jhse = jitter(hse)) smallds = subset(ds, select=c("gpa", "satm", "satv", "jhsm", "jhss", "jhse")) pairs(smallds, pch=".") @ \subsection{Multiple regression model} The output in Figure 11.4 (page 618) can be reproduced after fitting the model, which will be saved in the object called {\tt lm1}. <<>>= lm1 = lm(gpa ~ hsm + hss + hse, data=ds) coef(lm1) r.squared(lm1) summary(lm1) anova(lm1) @ Predicted values can be calculated from this model. <<>>= lm1fun = makeFun(lm1) lm1fun(hsm=9, hss=8, hse=7) lm1fun(hsm=9, hss=8, hse=7:9) @ In Figure 11.6 (page 621), the HSS predictor is dropped from the model. <<>>= lm2 = lm(gpa ~ hsm + hse, data=ds) summary(lm2) anova(lm2) @ In Figure 11.7 (page 622), the model is fit using SAT scores as explanatory variables. <<>>= lm3 = lm(gpa ~ satm + satv, data=ds) summary(lm3) anova(lm3) @ Finally, in Figure 11.8 (page 624), all possible explanatory variables are used. <<>>= lm.full = lm(gpa ~ satm + satv + hsm + hss + hse, data=ds) summary(lm.full) anova(lm.full) @ \subsection{Regression diagnostics} As always, we want to assess the fit of the model, and the assumptions needed for it. We begin by considering the distribution of the residuals. Figure 11.5 (page 620) displays the normal quantile plot, which can be generated using the {\tt qqnorm()} function. <<>>= qqnorm(residuals(lm1)) @ This can also be generated using a built-in plot option: <<>>= plot(lm1, which=2) @ Both displays indicate that the distribution of the residuals is approximately normal (with some evidence for a slightly heavy left tail). We could also generate a histogram with overlaid normal density (mean 0 and standard deviation equal to the root MSE from the model). <<>>= histogram(~ residuals(lm1), fit="normal") @ Next we want to consider the distribution of the residuals as a function of the fitted (predicted) values, as we don't want to see a systematic pattern in the relationship between these quantities. As is often the case, there are multiple ways to generate these plots within R. <<>>= plot(lm1, which=1) @ This can also be created in other ways: <<>>= ds = transform(ds, fit=fitted(lm1)) ds = transform(ds, resid=residuals(lm1)) xyplot(resid ~ fit, type=c("p", "r", "smooth"), data=ds) @ Here the {\tt "r"} option for {\tt type} specifies a regression (which will be a straight line), and the {\tt "smooth"} option adds a lowess (smooth line). There is some indication of non-linearity, particularly in the tails (but that's exactly where the lowess isn't to be trusted, as there's little data). Subject 188 may bear some additional scrutiny (as it has a very large negative residual): <<>>= ds[188,] @ We also want to display the residuals against each of the continuous predictors in the model. <<>>= xyplot(resid ~ satm, type=c("p", "r", "smooth"), data=ds) @ <<>>= xyplot(resid ~ satv, type=c("p", "r", "smooth"), data=ds) @ <<>>= xyplot(resid ~ hsm, type=c("p", "r", "smooth"), data=ds) @ The warnings are due to the discrete nature of the high school math variable, which takes on relatively few values. Jittering will help here: <<>>= xyplot(resid ~ jhsm, type=c("p", "r", "smooth"), xlab="Jittered High School Math grades", data=ds) @ <<>>= xyplot(resid ~ hss, type=c("p", "r", "smooth"), data=ds) @ <<>>= xyplot(resid ~ hse, type=c("p", "r", "smooth"), data=ds) @ Overall, we see reasonable linearity in the relationships, though for some subjects with low high school math or science grades, the regression model tends to systematically underpredict. <<>>= subset(ds, hsm < 4 | hss < 4) @ \subsection{More advanced residual analysis and regression diagnostics} Additional built-in residual plots can be requested (but we won't be using these much: check out \emph{The Statistical Sleuth} for more information). <<>>= plot(lm1, which=3) @ <<>>= plot(lm1, which=4) @ <<>>= plot(lm1, which=5) @ <<>>= plot(lm1, which=6) @ \end{document}