---
title: "IS5 in R: Sampling Distribution Models and Confidence Intervals for Proportions (Chapter 13)"
author: "Nicholas Horton (nhorton@amherst.edu)"
date: "2025-01-15"
date-format: iso
format: pdf
toc: true
editor: source
---

```{r}
#| label: setup
#| include: false
library(mosaic)   
library(tidyverse)
```


## Introduction and background 

This document is intended to help describe how to undertake analyses introduced as examples in the Fifth Edition of *Intro Stats* (2018) by De Veaux, Velleman, and Bock.
This file as well as the associated Quarto reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/is5.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.
In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses.
A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (https://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

We begin by loading packages that will be required for our analyses.

```{r}
library(mosaic)
library(tidyverse)
```


## Chapter 13: Sampling Distribution Models and Confidence Intervals for Proportions

```{r}
#| message: false
Babies <- read_csv("http://nhorton.people.amherst.edu/is5/data/Babysamp_98.csv") |>
  janitor::clean_names() |>
  mutate(status = ifelse(preemie, "Premature", "Normal"))
glimpse(Babies)
```

By default, `read_csv()` prints the variable names.
These messages have been suppressed using the `message: false` code chunk option to save space and improve readability. 
Here we use the `clean_names()` function from the `janitor` package to sanitize the names of the columns (which would otherwise contain special characters or whitespace).   

The `mutate()` function is used in conjunction with the `ifelse()` function to create a new variable `status`.

```{r}
# Figure 13.1, page 411
gf_histogram(~ gestation, binwidth = 1, center = .5, fill = ~ status, data = Babies) |>
  gf_labs(x = "Gestation Time (weeks)", y = "Births", fill = "")
```

### Section 13.1: The Sampling Distribution Model for a Proportion

#### The Normal Model

### Section 13.2: When Does the Normal Model Work? Assumptions and Conditions

#### Random Matters: Does the Normal Model Always Work? Sampling Distributions for Other Statistics

```{r}
#| message: false
# page 418
BodyFat <- read_csv("http://nhorton.people.amherst.edu/is5/data/Bodyfat.csv") |>
  janitor::clean_names()
set.seed(3245) # To ensure we get the same values when we run it multiple times
num_sim <- 1000 # Number of samples
```

```{r}
# What does do() do?
df_stats(~ weight, data = sample(BodyFat, 10)) # df_stats of one random sample of 10
df_stats(~ weight, data = sample(BodyFat, 10)) # df_stats of another random sample
```

The `do()` function in the `mosaic` package is used in combination with the `*` operator to run functions repeatedly.
We will use these functions to sample in different ways.

```{r}
do(2) * df_stats(~ weight, data = sample(BodyFat, 10)) # finds df_stats twice

# For the visualization, we need num_sim = 1,000 df_stats
bodyfatsamples <- do(num_sim) * df_stats(~ weight, data = sample(BodyFat, 10))
```

Here the `do()` function repeatedly calculates the summary statistics for a random sample of 10 weights. 

```{r}
#| message: false
bodyfatsamples <- bodyfatsamples |>
  janitor::clean_names()
names(bodyfatsamples)
gf_histogram(~ median, data = bodyfatsamples, binwidth = 3, center = 1.5) |>
  gf_labs(x = "Medians", y = "# of Samples")
gf_histogram(~ sd^2, data = bodyfatsamples) |>
  gf_labs(x = "Variance", y = "# of Samples")
gf_histogram(~ min, data = bodyfatsamples, binwidth = 3, center = 1.5) |>
  gf_labs(x = "Minimums", y = "# of Samples")
```

### Section 13.3: A Confidence Interval for a Proportion


### Section 13.4: Interpreting Confidence Intervals: What Does 95% Confidence Really Mean?

First we can replicate the example on pages 423-424.

```{r}
y <- 1034
n <- 1520
phat <- y / n
phat
sephat <- sqrt(phat * (1 - phat) / n)
sephat
phat + c(-2, 2) * sephat # matches interval on the bottom of page 423
```

Note that we should actually use 1.96 rather than 2 as the multipliers.

We can also use the `prop.test()` and `binom.test()` functions to calculate the interval for us.

```{r}
prop.test(y, n, correct = FALSE) # large sample methods
binom.test(y, n) # exact methods
```

The intervals are almost exactly the same (not surprising, given how large a sample size we have).

Next, we can recreate the simulation displayed in Figure 13.9 (page 422)
```{r}
set.seed(118)
CIsim(n = 100, samples = 20) # We expect 19/20 intervals to cover the true mean
```

We expect 19 of the 20 intervals to cover the true mean, but since only 20 samples are drawn, there is more variability. Only 18 out of the 20 intervals cover the true mean in this example.  

To get the actual plot, the code is more complicated.  

```{r}
set.seed(234)
findingpoints <- function(sampsize) {
  CItest <- do(1) * t.test(~preemie, data = sample(Babies, size = sampsize))
  # Using do() so that CItest can run as a data frame
  CItest <- CItest |>
    select(lower, upper) |>
    mutate(
      mean = (upper + lower) / 2,
      success = ifelse(lower <= .11 & upper >= .11, TRUE, FALSE)
    )
}

numsamp <- 20
ConfData <- do(numsamp) * findingpoints(sampsize = 100)

gf_point(mean ~ (1:numsamp), data = ConfData, color = ~ success) |>
  gf_segment(upper + lower ~ (1:numsamp) + (1:numsamp), data = ConfData) |>
  gf_hline(yintercept = ~ mean(preemie), data = Babies, color = 4) |>
  gf_labs(x = "", y = "Proportion")
```


### Section 13.5: Margin of Error: Certainty vs. Precision

### Section 13.6: Choosing the Sample Size