--- title: "IPS9 in R: Looking at Data -- Relationships (Chapter 2)" author: "Margaret Chien and Nicholas Horton (nhorton@amherst.edu)" date: "July 26, 2018" output: pdf_document: fig_height: 3 fig_width: 5 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include = FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages require(mosaic) require(vcd) # for the mosaicplot ``` ```{r, include = FALSE} # knitr settings to control how R chunks work. knitr::opts_chunk$set( tidy = FALSE, # display code as typed size = "small" # slightly smaller font for code ) ``` ## Introduction and background These documents are intended to help describe how to undertake analyses introduced as examples in the Ninth Edition of \emph{Introduction to the Practice of Statistics} (2017) by Moore, McCabe, and Craig. More information about the book can be found [here](https://macmillanlearning.com/Catalog/product/introductiontothepracticeofstatistics-ninthedition-moore). The data used in these documents can be found under Data Sets in the [Student Site](https://www.macmillanlearning.com/catalog/studentresources/ips9e?_ga=2.29224888.526668012.1531487989-1209447309.1529940008#). This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at https://nhorton.people.amherst.edu/ips9/. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. ## Chapter 2: Looking at Data -- Relationships This file replicates the analyses from Chapter 2: Looking at Data -- Relationships. First, load the packages that will be needed for this document: ```{r load-packages} library(mosaic) library(readr) library(vcd) ``` ### Section 2.1: Relationships ### Section 2.2: Scatterplots #### Example 2.8: Laundry detergents ```{r} # page 85 Laundry <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-008LAUNDRY.csv") ``` We use `read_csv()` from the `readr` package to import data into R. We can use `message=FALSE` to suppress the warning messages for readability. ```{r} # 2.10: Examine the spreadsheet favstats(~ Rating, data = Laundry) ``` The `favstats()` function summarizes useful statistics of variables. ```{r} # 2.11: Use the data set gf_histogram(~ Rating, data = Laundry, binwidth = 5) favstats(~ Price, data = Laundry) gf_histogram(~ Price, data = Laundry, binwidth = 5) ``` #### Example 2.9: Laundry detergents ```{r} Laundry <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-009LAUNDRY.csv") # Figure 2.1, page 86 gf_point(Rating ~ Price, data = Laundry, xlab = "Price per load (cents)") # Figure 2.2 gf_point(Rating ~ Price, data = Laundry, xlab = "Price per load (cents)") %>% gf_lims(x = c(0, 35)) ``` ```{r} # 2.12: Make a scatterplot (page 87) gf_jitter(Rating ~ Price, data = Laundry, xlab = "Price per load (cents)") %>% gf_lims(x = c(0, 35)) ``` We can use `gf_jitter()` to add some noise into the plot to show overlapped points. We also use `gf_lims()` to set x-axis limits on the plot. ```{r} # 2.13: Change the units Laundry2 <- Laundry %>% mutate(Price = Price/100) favstats(~ Price, data = Laundry2) gf_point(Rating ~ Price, data = Laundry2, xlab = "Price per load (dollars)") ``` We use `mutate()` to create new variables in a dataset. #### Example 2.10: Scatterplot with a straight line ```{r} Laundry <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-010LAUND.csv") # Figure 2.3, page 88 gf_point(Rating ~ Price, data = Laundry, xlab = "Price per load (cents)") %>% gf_lm() ``` #### Example 2.11: Education spending and population: Benchmarking ```{r} EduSpending <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-011EDSPEND.csv") head(EduSpending) ``` We can use the `head()` function to look at the first rows of a data set. ```{r} # Figure 2.5, page 90 gf_point(Spending ~ Population, data = EduSpending) %>% gf_lm() %>% gf_labs(x = "Population (millions)", y = "Spending ($ billion)") ``` #### Example 2.12: Calcium retention ```{r} Calcium <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-012CALCIUM.csv") # Figure 2.6, page 90 gf_point(CaRetention~ CaIntake, data = Calcium) %>% gf_smooth(span = 1) %>% gf_labs(x = "Calcium intake (mg/d)", y = "Calcium retention (mg/d)") ``` #### Example 2.13: Calcium retention with logarithms ```{r} Calcium <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-013CALCIUM.csv") # Figure 2.7, page 91 gf_point(LogRet ~ CaIntake, data = Calcium) %>% gf_lm() %>% gf_labs(x = "Calcium intake (mg/d)", y = "Log calcium retention") ``` #### Example 2.14: Education spending and population with logarithms ```{r} EduSpending <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-014EDSPEND.csv") EduSpending <- EduSpending %>% mutate(LogPop = log(Population), LogSpend = log(Spending)) # Figure 2.8, page 92 gf_point(LogSpend ~ LogPop, data = EduSpending) %>% gf_lm() %>% gf_labs(x = "Log population", y = "Log spending") ``` #### Example 2.15: Rating versus price and type of laundry detergent ```{r} Laundry <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-015LAUND.csv") # Figure 2.9, page 93 gf_point(Rating ~ Price, color = ~ Type, data = Laundry, xlab = "Price per load (cents)") %>% gf_lm() ``` #### Example 2.16: Laundry rating versus price with a smooth fit ```{r} Laundry <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-016LAUND.csv") # Figure 2.10, page 94-95 gf_point(Rating ~ Price, data = Laundry) %>% gf_smooth(span = .5) gf_point(Rating ~ Price, data = Laundry) %>% gf_smooth(span = 5) ``` #### Example 2.17: A smooth fit for education spending and population with logs ```{r} # Figure 2.8, page 92 gf_point(LogSpend ~ LogPop, data = EduSpending) %>% gf_smooth() %>% gf_labs(x = "Log population", y = "Log spending") ``` ### Section 2.3: Correlation #### Use Your Knowledge: Laundry detergents ```{r} # page 102 cor(Rating ~ Price, data = Laundry) ``` The `cor()` function finds the correlation of two variables. ### Section 2.4: Least Squares Regression #### Example 2.19: Fidgeting and fat gain ```{r} Fidgeting <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-019FIDGET.csv") # Figure 2.16, page 108 gf_point(Fat ~ NEA, data = Fidgeting) %>% gf_labs(x = "Nonexercise activity (calories)", y = "Fat gain (kilograms)") ``` #### Example 2.20: Regression line for fat gain ```{r} Fidgeting <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-020FIDGET.csv") # Figure 2.17, page 109 gf_point(Fat ~ NEA, data = Fidgeting) %>% gf_lm() %>% gf_labs(x = "Nonexercise activity (calories)", y = "Fat gain (kilograms)") # Use Your Knowledge 2.61: Plot the line gf_point(Fat ~ NEA, data = Fidgeting) %>% gf_abline(slope = -.00344, intercept = 4.505) %>% gf_labs(x = "Nonexercise activity (calories)", y = "Fat gain (kilograms)") + ylim(0, 5) ``` #### Example 2.21: Prediction for fat gain ```{r} fatlm <- lm(Fat ~ NEA, data = Fidgeting) fatfun <- makeFun(fatlm) fatfun(NEA = 400) ``` We use `makeFun()` to create a function. Here, we make a function from our linear model, created from `lm()`, so we can find the output of a certain value of `NEA`. #### Example 2.24: Regression ```{r} # page 113 msummary(fatlm) ``` The `msummary()` function shows the properties of the function. #### Example 2.25: Fidgeting and fat gain ```{r} # Figure 2.20, page 115 (split into two plots) gf_point(Fat ~ NEA, data = Fidgeting) %>% gf_lm() %>% gf_labs(x = "Nonexercise activity (calories)", y = "Fat gain (kilograms)") gf_point(NEA ~ Fat, data = Fidgeting) %>% gf_lm(color = "red") %>% gf_labs(x = "Fat gain (kilograms)", y = "Nonexercise activity") # Models fatlm NEAlm <- lm(NEA ~ Fat, data = Fidgeting) NEAlm ``` ### Section 2.5: Cautions about Correlation and Regression #### Example 2.26: Residuals for fat gain Here, we find the residual: ```{r} fatfun(NEA = 135) 2.7 - fatfun(NEA = 135) ``` ```{r} # Figure 2.23, page 124 gf_point(Fat ~ NEA, data = Fidgeting) %>% gf_lm() %>% gf_labs(x = "Nonexercise activity (calories)", y = "Fat gain (kilograms)") gf_point(resid(fatlm) ~ NEA, data = Fidgeting) %>% gf_lm() %>% gf_labs(x = "Nonexercise activity (calories)", y = "Residual") ``` #### Example 2.27: Patterns in birthrate and Internet user residuals ```{r} IntBirth <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-027INBIRTH.csv") intbirthlm <- lm(BirthRate2011 ~ Users, data = IntBirth) # Figure 2.24, page 126 gf_point(BirthRate2011 ~ Users, data = IntBirth) %>% gf_lm() %>% gf_labs(x = "Internet users per 100 people", y = "Births per 1000 people") gf_point(resid(intbirthlm) ~ Users, data = IntBirth) %>% gf_lm() %>% gf_labs(x = "Internet users per 100 people", y = "Residual") ``` #### Example 2.28: Diabetes and blood sugar ```{r} Diabetes <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-028HBA1C.csv") diabeteslm <- lm(FPG_mg_ml ~ HbA1c_percent, data = Diabetes) # Figure 2.25, page 127 gf_point(FPG_mg_ml ~ HbA1c_percent, data = Diabetes) %>% gf_lm() %>% gf_labs(x = "HbA1c (percent)", y = "Fasting plasma glucose (mg/dl)") gf_point(resid(diabeteslm) ~ HbA1c_percent, data = Diabetes) %>% gf_lm() %>% gf_labs(x = "HbA1c (percent)", y = "Residual") ``` #### Example 2.29: Influential observations We can use the `filter()` function to remove rows from a data set: ```{r} without15lm <- lm(FPG_mg_ml ~ HbA1c_percent, data = filter(Diabetes, FPG_mg_ml <= 300)) # model without subject 15 without18lm <- lm(FPG_mg_ml ~ HbA1c_percent, data = filter(Diabetes, HbA1c_percent <= 18)) # model without subject 18 # Figure 2.26, page 129 gf_point(FPG_mg_ml ~ HbA1c_percent, data = Diabetes) %>% gf_lm(color = "red") %>% gf_fun(without15lm, linetype = 2, color = "red") %>% gf_fun(without18lm, linetype = 3, color = "red") %>% gf_labs(x = "HbA1c (percent)", y = "Fasting plasma glucose (mg/dl)") ``` ### Section 2.6: Data Analysis from Two Way Tables #### Example 2.33: Is the calcium intake adequate? ```{r} Calcium <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-033IOM.csv") Calcium ``` To create a data set with the structure we want (with each count as an observation), we can use `rbind()`. ```{r} # page 137 # Creating data set from counts in the data table CalciumC <- rbind( do(194) * data.frame(Age = "A05to10", Met = "No"), do(861) * data.frame(Age = "A05to10", Met = "Yes"), do(557) * data.frame(Age = "A11to13", Met = "No"), do(417) * data.frame(Age = "A11to13", Met = "Yes") ) # Table tally(Met ~ Age, data = CalciumC) ``` #### Example 2.34: Add the margins to the table ```{r} tally(Met ~ Age, data = CalciumC, margins = TRUE) tally(Age ~ Met, data = CalciumC, margins = TRUE) ``` #### Example 2.35: The joint distribution ```{r} tally(~ Age + Met, data = CalciumC, format = "proportion") ``` #### Example 2.36: The marginal distribution of age ```{r} tally(~ Age, data = CalciumC, format = "proportion") ``` #### Example 2.37: The marginal distribution of "met requirement" ```{r} tally(~ Met, data = CalciumC, format = "proportion") ``` #### Example 2.39: Conditional distribution of "met requirement" for children aged 5 to 10 ```{r} # page 141 tally(Met ~ Age, data = CalciumC, format = "proportion") ``` #### Example 2.40: Software output We can use the `mosaic()` function from the `vcd` package to create mosaic plots with color. ```{r} # Figure 2.28 mosaic plot (page 143) vcd::mosaic(~ Met + Age, data = CalciumC, shade = TRUE) ``` #### Example 2.41: Which customer service representative is better? ```{r} # page 143 CustomerService <- read_csv("https://nhorton.people.amherst.edu/ips9/data/chapter02/EG02-041CUSTSER.csv") CustomerService %>% select(Rep, GoalMet, Count) ``` #### Example 2.42: Look at the data more carefully ```{r} CustomerService %>% select(Rep, GoalMet_1, Count_1, GoalMet_2, Count_2) ``` ### Section 2.7: The Question of Causation