--- title: "IS5 in R: Displaying and Describing Data (Chapter 2)" author: "Nicholas Horton (nhorton@amherst.edu)" date: "2025-01-20" date-format: iso format: pdf toc: true editor: source --- ```{r} #| label: setup #| include: false library(mosaic) library(tidyverse) ``` ## Introduction and background This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock. This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5. This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic). A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024. We begin by loading packages that will be required for our analyses. ```{r} library(mosaic) library(tidyverse) ``` ## Chapter 2: Displaying and Describing Data ### Section 2.1: Summarizing and Displaying a Categorical Variable ```{r} #| message: false options(digits = 3) # make decimal results easier to read Titanic <- read_csv("http://nhorton.people.amherst.edu/is5/data/Titanic.csv") ``` By default, the `read_csv()` function (loaded with the `tidyverse` package) prints the variable names. These messages were suppressed using the `message: false` code chunk option to save space and improve readability. ```{r} # Table 2.2, page 19 tally(~ Class, data = Titanic) # Table 2.3 tally(~ Class, format = "percent", data = Titanic) # Figure 2.2, page 19 gf_bar(~ Class, data = Titanic) |> gf_labs(x = "Ticket Class", y = "Number of People") ``` `GOAL(~ X)` is the general form of the modeling language for one variable in the `mosaic` package. We use `gf_bar()` to make a bar graph using the `ggformula` system, which is automatically make accessible by loading the `mosaic` package. See the Mosaic Minimal Guide for more details: [https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdf](https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdf) ### Section 2.2: Displaying a Quantitative Variable #### Ages of Those Aboard the Titanic ```{r} # Figure 2.7, page 24 gf_histogram(~ Age, data = Titanic, binwidth = 5, ylab = "# of People", center = 5 / 2) ``` The function generates a warning because three of the ages are missing; this output can (and should!) be suppressed by adding `warning: false` as an option in this code chunk. Note: be sure to double check about a warning before suppressing it! #### Earthquakes and Tsunamis We begin by reading in the data. ```{r} #| message: false #| warning: false # Example 2.3, page 25 Earthquakes <- read_csv("http://nhorton.people.amherst.edu/is5/data/Tsunamis_2016.csv") gf_histogram(~ Primary_Magnitude, data = Earthquakes, binwidth = 0.5, ylab = "# of Earthquakes", xlab = "Magnitude", center = 0.25 ) ``` #### Stem-and-Leaf Displays See page 26. ```{r} #| message: false # Figure 2.8, page 26 Pulse_rates <- read_csv("http://nhorton.people.amherst.edu/is5/data/Pulse_rates.csv") gf_histogram(~ Pulse, data = Pulse_rates, binwidth = 5, center = 5 / 2) with(Pulse_rates, stem(Pulse)) # base R function that doesn't have a `data =` option. ``` #### Dotplot See Figure 2.9, page 27 ```{r} #| message: false Derby <- readr::read_csv("http://nhorton.people.amherst.edu/is5/data/Kentucky_Derby_2016.csv") gf_dotplot(~Time_Sec, data = Derby, binwidth = 1) |> gf_labs(x = "Winning Time (sec)", y = "# of Races") ``` #### Density Plots There are two forms of density plots: not-shaded, and shaded. The former will be useful when comparing multiple densities. ```{r} #| warning: false # Figure 2.10, page 27 gf_dens(~ Age, data = Titanic) gf_density(~ Age, data = Titanic) ``` ### Section 2.3: Shape See displays on pages 28-29. #### Consumer Price Index First we need to load the data. ```{r} #| message: false CPI <- read_csv("http://nhorton.people.amherst.edu/is5/data/CPI_Worldwide.csv") |> janitor::clean_names() ``` The pipe operator (`|>`) takes the output of the first command and uses it as input for the next. The variable names coming out of spreadsheet files can sometimes be quite wonky. The results from `read_csv()` are passed to the `clean_names()` function from the `janitor` package to make them more consistent. ```{r} names(CPI) ``` You can use the `names()` function to check the reformatted names. (We'll often use the `glimpse()` command to provide even more information about a dataset.) ```{r} # Example 2.5, page 30 gf_histogram(~consumer_price_index, data = CPI, ylab = "# of Cities", xlab = "Consumer Price Index", binwidth = 5, center = 5 / 2 ) ``` #### Credit Card Expenditures First we load the data. ```{r} #! message: false CreditCardEx <- read_csv("http://nhorton.people.amherst.edu/is5/data/Credit_card_charges.csv") |> janitor::clean_names() # Figure 2.6, page 30 gf_histogram(~charges, data = CreditCardEx, ylab = "# of Customers", xlab = "Average Monthly Expenditure ($)", binwidth = 400, center = 200 ) ``` ### Section 2.4: Center #### Finding Median and Mean First we need to load the data. ```{r} #| warning: false TitanicCrew <- filter(Titanic, Class == "Crew") # Figure 2.15, page 32 TitanicCrew |> mutate(color = ifelse(Age <= median(Age), "Less", "Greater")) |> gf_histogram(~ Age, fill = ~color, binwidth = 5, center = 5 / 2, ylab = "# of Crew Members") |> gf_labs(fill = "Age compared to median") # Figure 2.16 gf_histogram( ~Age, data = TitanicCrew, ylab = "# of Crew Members", binwidth = 5, center = 5 / 2) |> gf_vline(xintercept = mean(~Age, data = TitanicCrew)) df_stats(~Age, data = TitanicCrew) ``` Another way to generate summary statistics is the `favstats()` command (we will stick to `df_stats()` because it is more flexible). ```{r} favstats(~Age, data = TitanicCrew) ``` ### Section 2.5: Spread #### The Range ```{r} range(~Age, data = TitanicCrew) diff(range(~Age, data = TitanicCrew)) ``` The `range()` function returns the maximum and minimum values, so we can use the `diff()` function to find the difference between the two values. #### The Interquartile Range ```{r} df_stats(~ Age, data = TitanicCrew) IQR(~ Age, data = TitanicCrew) ``` Using the `IQR()` function allows us to avoid having to manually find the IQR by subtracting Q1 from Q3 from the `df_stats()` output. #### Standard Deviation ```{r} sd(~Age, data = TitanicCrew) var(~Age, data = TitanicCrew) ``` #### Summarizing a Distribution First we need to load the data. ```{r} #| message: false Nissan <- read_csv("http://nhorton.people.amherst.edu/is5/data/Nissan.csv") # Step-by-Step Example, page 39 gf_histogram(~mpg, data = Nissan, binwidth = 1, xlab = "Fuel Efficiency (mpg)", ylab = "# of Fill-ups", center = 5 / 2 ) df_stats(~ mpg, data = Nissan) ``` #### Random Matters First we need to load the data. ```{r} #| message: false Commute <- read_csv("http://nhorton.people.amherst.edu/is5/data/Population_Commute_Times.csv") |> janitor::clean_names() # Figure 2.19, page 40 gf_histogram(~ commute_time, data = Commute, binwidth = 10, xlab = "Commute Time (min)", ylab = "# of Employees", center = 5 ) ```