--- title: "IS5 in R: Scatterplots, Association, and Correlation (Chapter 6)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "December 19, 2020" output: pdf_document: fig_height: 3 fig_width: 5 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include = FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages library(mosaic) library(readr) library(janitor) library(GGally) ``` ```{r, include = FALSE} # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy = FALSE, # display code as typed size = "small" # slightly smaller font for code ) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. ## Chapter 6: Scatterplots, Association, and Correlation We begin by reading in the data. ```{r message = TRUE} library(mosaic) library(readr) library(janitor) Hurricanes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Tracking_hurricanes_2015.csv") ``` By default, `read_csv()` prints the variable names. These messages can (and should!) be suppressed using the `message=FALSE` code chunk option to save space and improve readability. ```{r} # Figure 6.1, page 164 gf_point(Error_72h ~ Year, data = Hurricanes, ylab = "Prediction Error") ``` ### Section 6.1: Scatterplots See dots on pages 164-165. #### Example 6.1: Comparing Prices Worldwide We begin by reading in the data. ```{r message = FALSE} Prices <- read_csv("http://nhorton.people.amherst.edu/is5/data/Prices_and_Earnings.csv") %>% janitor::clean_names() names(Prices) glimpse(Prices) ``` Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace). The `names()` function displays the names while the `glimpse()` function provides more detail about the dataset. ```{r} gf_point(food_costs ~ womens_clothing, data = Prices) %>% gf_labs(x = "Cost of Women's Clothes", y = "Food Costs ($)") gf_point(i_phone_4s_hr ~ wage_gross, data = Prices) %>% gf_labs(x = "Average Hourly Wage", y = "Hours to Earn an iPhone 4S") gf_point(clothing_index ~ hours_worked, data = Prices) %>% gf_labs(x = "Working Hours", y = "Clothes Index") gf_point(food_costs ~ vacation_days, data = Prices) %>% gf_labs(x = "Vacation Days (per year)", y = "Food Costs ($)") ``` #### Roles for Variables #### Smoothing Scatterplots We begin by reading in the data. ```{r message = FALSE, warning = FALSE} HopkinsForest <- read_csv("http://nhorton.people.amherst.edu/is5/data/Hopkins_Forest.csv") %>% janitor::clean_names() # Figure 6.2, page 168 gf_point(avg_wind_mph ~ day_of_year, data = HopkinsForest) %>% gf_smooth(se = FALSE) %>% gf_labs(x = "Day of Year", y = "Average Wind Speed (mph)") ``` The smoother warning messages provided by `gf_smooth()` have been removed from this output using the `warning = FALSE` code chunk option. #### Example 6.2: Smoothing Timeplots We will explore smoothing using the fitness data. ```{r, message=FALSE, warning = FALSE} Fitness <- read_csv("http://nhorton.people.amherst.edu/is5/data/Fitness_data.csv") %>% janitor::clean_names() gf_histogram(~weight, data = Fitness, binwidth = 1, center = .5) %>% gf_labs(x = "Weight (lb)", y = "# of Days") gf_point(weight ~ days_since_july_19_2014, data = Fitness) %>% gf_smooth(se = FALSE) %>% gf_labs(x = "Days Since July 19, 2014", y = "Weight (lb)") gf_boxplot(weight ~ as.factor(month), data = Fitness) %>% gf_labs(x = "Month", y = "Weight (lb)") ``` Warnings should be suppressed for your work once you have determined that they are innocuous. ### Section 6.2: Correlation We begin by reading in the data. ```{r message = FALSE} HeightsWeights <- read_csv("http://nhorton.people.amherst.edu/is5/data/Heights_and_weights.csv") # Figure 6.3, page 170 gf_point(Weight ~ Height, data = HeightsWeights) %>% gf_labs(x = "Height (in.)", y = "Weight (lb)") cor(Weight ~ Height, data = HeightsWeights) ``` See displays on pages 170 - 171. #### Step-by-Step Example: Looking at Association We begin by loading the Framingham data. ```{r message = FALSE, warning = FALSE} Framingham <- read_csv("http://nhorton.people.amherst.edu/is5/data/Framingham.csv") gf_point(SBP ~ DBP, data = Framingham) %>% gf_labs(x = "Diastolic BP (mm Hg)", y = "Systolic BP (mm Hg)") cor(SBP ~ DBP, data = Framingham) ``` #### Random Matters: Correlations Vary A recurring theme in the course involves random sampling of various sorts. Here we explore a sample of babies born in 1998. ```{r, cache=TRUE, message = FALSE} LiveBirths <- read_csv("http://nhorton.people.amherst.edu/is5/data/Babysamp_98.csv") %>% janitor::clean_names() LiveBirths <- LiveBirths %>% filter(dad_age != "NA") set.seed(14513) # To ensure we get the same values when we run it multiple times numsim <- 10000 # Number of samples gf_point(mom_age ~ dad_age, data = sample(LiveBirths, size = 50)) # Graph will look different for different samples cor(mom_age ~ dad_age, data = LiveBirths) # What does mosaic::do() do? cor(mom_age ~ dad_age, data = sample(LiveBirths, size = 50)) # Correlation of one random sample cor(mom_age ~ dad_age, data = sample(LiveBirths, size = 50)) # Correlation of another random sample do(2) * cor(mom_age ~ dad_age, data = sample(LiveBirths, size = 50)) # Finds the correlation twice # For the visualization, we need 10,000 correlations LiveCorr <- do(numsim) * cor(mom_age ~ dad_age, data = sample(LiveBirths, size = 50)) ``` The `do()` function runs, 10,000 times, the correlation and sampling functions on a random sample of 50. (We can use the chunk option `cache=TRUE` to enable caching to save results for next time.) ```{r} # Figure 6.8, page 176 gf_histogram(~cor, data = LiveCorr, binwidth = .05, center = .025) %>% gf_labs( x = "Correlation of Mother's Age and Father's Age in Samples of Size 50", y = "# of Samples" ) ``` ### Section 6.3: Warning: Correlation $\neq$ Causation The storks data is a classic example of how correlation does not always imply causation. ```{r message = FALSE} Storks <- read_csv("http://nhorton.people.amherst.edu/is5/data/Storks.csv") # Figure 6.9 gf_point(Population ~ Storks, data = Storks) %>% gf_labs(x = "# of Storks", y = "Human Population") ``` #### Correlation Tables We can display correlation tables as seen in Table 6.1 on page 178. ```{r message = FALSE} Companies <- read_csv("http://nhorton.people.amherst.edu/is5/data/Companies.csv") %>% janitor::clean_names() # Table 6.1, page 178 Companies %>% select(assets, sales, market_value, profits, cash_flow, employees) %>% cor() ``` ```{r fig.width = 7, fig.height = 7} Companies %>% select(assets, sales, market_value, profits, cash_flow, employees) %>% GGally::ggpairs() ``` ### Section 6.4: Straightening Scatterplots It's often possible to straightend scatterplots through use of a transformation. ```{r message = FALSE} FStops <- read_csv("http://nhorton.people.amherst.edu/is5/data/F-stops.csv") %>% janitor::clean_names() # Figure 6.10, page 179 gf_point(f_stop ~ shutter_speed, data = FStops) %>% gf_labs(x = "Shutter Speed (sec)", y = "f/stop") cor(f_stop ~ shutter_speed, data = FStops) ``` #### The Ladder of Powers #### f/Stops Again The f/Stops example is reviewed on page 181 (Figure 6.11) ```{r} gf_point(log(f_stop) ~ shutter_speed, data = FStops) %>% gf_labs(x = "Shutter Speed (sec)", y = "Log (f/stop)") # Figure 6.12 gf_point((f_stop)^2 ~ shutter_speed, data = FStops) %>% gf_labs(x = "Shutter Speed (sec)", y = "f/stop squareed") ``` See the displays in "What Can Go Wrong?" on pages 181-183.