--- title: "IS5 in R: Linear Regression (Chapter 7)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "December 19, 2020" output: pdf_document: fig_height: 3 fig_width: 5 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include = FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages library(mosaic) library(readr) library(janitor) ``` ```{r, include = FALSE} # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy = FALSE, # display code as typed size = "small" # slightly smaller font for code ) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. ## Chapter 7: Linear Regression ```{r message = FALSE} library(mosaic) library(readr) library(janitor) # Figure 7.1 BurgerKing <- read_csv("http://nhorton.people.amherst.edu/is5/data/Burger_King_items.csv") %>% janitor::clean_names() ``` By default, `read_csv()` prints the variable names. These messages have been suppressed using the `message=FALSE` code chunk option to save space and improve readability. Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). ```{r message = FALSE} gf_point(fat_g ~ protein_g, data = BurgerKing) %>% gf_smooth() %>% gf_labs(x = "Protein (g)", y = "Fat (g)") ``` Here we add a smoother to show a clearer picture of the relationship. ### Section 7.1: Least Squares: The Line of "Best Fit" See display on page 197. We can calculate the residual for a particular value with 31 grams of protein by creating an function called `burgerfun` using the `mosaic::makeFun()` function. ```{r} burgerlm <- lm(fat_g ~ protein_g, data = BurgerKing) burgerfun <- makeFun(burgerlm) burgerfun(protein_g = 31) ``` ### Section 7.2: The Linear Model ```{r} coef(burgerlm) burgerfun(protein_g = 0) burgerfun(32) - burgerfun(31) ``` ```{r} msummary(burgerlm) ``` #### Example 7.1: A Linear Model for Hurricanes We begin by reading in the data. ```{r message = FALSE} Hurricanes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Hurricanes_2015.csv") %>% janitor::clean_names() gf_point(max_wind_speed_kts ~ central_pressure_mb, data = Hurricanes) %>% gf_lm() %>% gf_labs(x = "Central Pressure (mb)", y = "Max Wind Speed (kts)") ``` The function generates a warning because some of the data are missing: this output can (and should!) be suppressed by adding `warning=FALSE` as an option in this code chunk. Later examples will suppress this extraneous output. ### Section 7.3: Finding the Least Squares Line #### Example 7.2: Finding the Regression Equation ```{r} df_stats(~protein_g, data = BurgerKing) df_stats(~fat_g, data = BurgerKing) sx <- sd(~protein_g, data = BurgerKing) sx sy <- sd(~fat_g, data = BurgerKing) sy r <- cor(protein_g ~ fat_g, data = BurgerKing) r # same as cor(fat_g ~ protein_g)! r * sy / sx coef(burgerlm)[2] ``` #### Step-by-Step Example: Calculating a Regression Equation We begin by loading the bridge dataset. ```{r message = FALSE} TompkinsBridges <- read_csv("http://nhorton.people.amherst.edu/is5/data/Tompkins_county_bridges_2016.csv") %>% janitor::clean_names() gf_point(condition ~ age_at_inspection, data = TompkinsBridges) %>% gf_smooth() %>% # To show relationship gf_labs(x = "Age at Inspection", y = "Condition") ``` See calculations on page 203. ### Section 7.4: Regression to the Mean See Figure 7.4 on page 205 to visualize standard deviations. ### Section 7.5: Examining the Residuals ```{r} msummary(burgerlm) # Figure 7.5 , page 207 gf_point(resid(burgerlm) ~ protein_g, data = BurgerKing) %>% gf_lm() %>% gf_labs(x = "Protein (g)", y = "Residuals (g fat)") # Figure 7.6 gf_histogram(~ resid(burgerlm), binwidth = 10, center = 5) %>% gf_labs(x = "Residuals (g fat)", y = "# of Residuals") ``` ### Section 7.6: $R^2$--The Proportion of Variation Accounted for by the Model ```{r} rsquared(burgerlm) ``` ### Section 7.7: Regression Assumptions and Conditions