--- title: "IS5 in R: Displaying and Describing Data (Chapter 2)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "December 19, 2020" output: pdf_document: fig_height: 3 fig_width: 5 html_document: fig_height: 3 fig_width: 5 word_document: fig_height: 4 fig_width: 6 --- ```{r, include = FALSE} # Don't delete this chunk if you are using the mosaic package # This loads the mosaic and dplyr packages library(mosaic) library(readr) library(janitor) ``` ```{r, include = FALSE} # knitr settings to control how R chunks work. require(knitr) opts_chunk$set( tidy = FALSE, # display code as typed size = "small" # slightly smaller font for code ) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. ## Chapter 2: Displaying and Describing Data ### Section 2.1: Summarizing and Displaying a Categorical Variable ```{r message = FALSE} library(mosaic) library(readr) library(janitor) # for variable names options(digits = 3) Titanic <- read_csv("http://nhorton.people.amherst.edu/is5/data/Titanic.csv") ``` By default, `read_csv()` prints the variable names. These messages were suppressed using the `message=FALSE` code chunk option to save space and improve readability. ```{r} # Table 2.2, page 19 tally(~Class, data = Titanic) # Table 2.3 tally(~Class, format = "percent", data = Titanic) # Figure 2.2, page 19 gf_bar(~Class, data = Titanic) %>% gf_labs(x = "Ticket Class", y = "Number of People") ``` `GOAL(~ X)` is the general form of the modeling language for one variable in the `mosaic` package. We use `gf_bar()` to make a bar graph using the `ggformula` system, which is automatically downloaded with the `mosaic` package. See the Minimal Guide for more details: https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdf ### Section 2.2: Displaying a Quantitative Variable #### Ages of Those Aboard the Titanic ```{r} # Figure 2.7, page 24 gf_histogram(~Age, data = Titanic, binwidth = 5, ylab = "# of People", center = 5 / 2) ``` The function generates a warning because three of the ages are missing; this output can (and should!) be suppressed by adding `warning=FALSE` as an option in this code chunk. #### Earthquakes and Tsunamis We begin by reading in the data. ```{r message = FALSE, warning = FALSE} # Example 2.3, page 25 Earthquakes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Tsunamis_2016.csv") gf_histogram(~Primary_Magnitude, data = Earthquakes, binwidth = 0.5, ylab = "# of Earthquakes", xlab = "Magnitude", center = 0.25 ) ``` #### Stem-and-Leaf Displays See page 26. ```{r message=FALSE} # Figure 2.8, page 26 Pulse_rates <- read_csv("http://nhorton.people.amherst.edu/is5/data/Pulse_rates.csv") gf_histogram(~Pulse, data = Pulse_rates, binwidth = 5, center = 5 / 2) with(Pulse_rates, stem(Pulse)) ``` #### Dotplot See Figure 2.9, page 27 ```{r message = FALSE} Derby <- read_csv("http://nhorton.people.amherst.edu/is5/data/Kentucky_Derby_2016.csv") gf_dotplot(~Time_Sec, data = Derby, binwidth = 1) %>% gf_labs(x = "Winning Time (sec)", y = "# of Races") ``` #### Density Plots There are two forms of density plots: not-shaded, and shaded. The former will be useful when comparing multiple densities. ```{r, warning = FALSE} # Figure 2.10, page 27 gf_dens(~Age, data = Titanic) gf_density(~Age, data = Titanic) ``` ### Section 2.3: Shape See displays on pages 28-29. #### Consumer Price Index First we need to load the data. ```{r message = FALSE} CPI <- read_csv("http://nhorton.people.amherst.edu/is5/data/CPI_Worldwide.csv") %>% janitor::clean_names() names(CPI) # Example 2.5, page 30 gf_histogram(~consumer_price_index, data = CPI, ylab = "# of Cities", xlab = "Consumer Price Index", binwidth = 5, center = 5 / 2 ) ``` We can use `clean_names()` from the `janitor` package to format the names of the columns when necessary. You can use the `names()` function to check the reformatted names. The pipe operator (`%>%`) takes the output of the line of code and uses it in the next. #### Credit Card Expenditures First we load the data. ```{r message = FALSE} CreditCardEx <- read_csv("http://nhorton.people.amherst.edu/is5/data/Credit_card_charges.csv") %>% janitor::clean_names() # Figure 2.6, page 30 gf_histogram(~charges, data = CreditCardEx, ylab = "# of Customers", xlab = "Average Monthly Expenditure ($)", binwidth = 400, center = 200 ) ``` ### Section 2.4: Center #### Finding Median and Mean First we need to load the data. ```{r warning = FALSE} TitanicCrew <- filter(Titanic, Class == "Crew") # Figure 2.15, page 32 TitanicCrew %>% mutate(color = ifelse(Age <= median(Age), "Less", "Greater")) %>% gf_histogram(~Age, fill = ~color, binwidth = 5, center = 5 / 2, ylab = "# of Crew Members") %>% gf_labs(fill = "Age compared to median") # Figure 2.16 gf_histogram( ~Age, data = TitanicCrew, ylab = "# of Crew Members", binwidth = 5, center = 5 / 2) %>% gf_vline(xintercept = mean(~Age, data = TitanicCrew)) df_stats(~Age, data = TitanicCrew) ``` Another way to generate summary statistics is the `favstats()` command (we will stick to `df_stats()` because it is more flexible). ```{r} favstats(~Age, data = TitanicCrew) ``` ### Section 2.5: Spread #### The Range ```{r} range(~Age, data = TitanicCrew) diff(range(~Age, data = TitanicCrew)) ``` The `range()` function returns the maximum and minimum values, so we can use the `diff()` function to find the difference between the two values. #### The Interquartile Range ```{r} df_stats(~Age, data = TitanicCrew) IQR(~Age, data = TitanicCrew) ``` Using the `IQR()` function allows us to avoid having to manually find the IQR by subtracting Q1 from Q3 from the `df_stats()` output. #### Standard Deviation ```{r} sd(~Age, data = TitanicCrew) var(~Age, data = TitanicCrew) ``` #### Summarizing a Distribution First we need to load the data. ```{r message = FALSE} Nissan <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nissan.csv") # Step-by-Step Example, page 39 gf_histogram(~mpg, data = Nissan, binwidth = 1, xlab = "Fuel Efficiency (mpg)", ylab = "# of Fill-ups", center = 5 / 2 ) df_stats(~mpg, data = Nissan) ``` #### Random Matters First we need to load the data. ```{r message = FALSE} Commute <- read_csv("http://nhorton.people.amherst.edu/is5/data/Population_Commute_Times.csv") %>% janitor::clean_names() # Figure 2.19, page 40 gf_histogram(~commute_time, data = Commute, binwidth = 10, xlab = "Commute Time (min)", ylab = "# of Employees", center = 5 ) ```