---
title: "SDM4 in R: Re-expressing Data: Get it Straight! (Chapter 9)"
author: "Nicholas Horton (nhorton@amherst.edu), Patrick Frenett, and Sarah McDonald"
date: "June 13, 2018"
output:
pdf_document:
fig_height: 3
fig_width: 6
html_document:
fig_height: 3
fig_width: 5
word_document:
fig_height: 4
fig_width: 6
---
```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
require(mosaic)
```
```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
tidy = FALSE, # display code as typed
size = "small" # slightly smaller font for code
)
```
## Introduction and background
This document is intended to help describe how to undertake analyses introduced
as examples in the Fourth Edition of *Stats: Data and Models* (2014) by De Veaux, Velleman, and Bock.
More information about the book can be found at http://wps.aw.com/aw_deveaux_stats_series. This
file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/sdm4.
This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.
## Chapter 9: Re-expressing Data: Get it Straight!
### Section 9.1: Straightening Scatterplots - The Four Goals
The `histogram` function will generate the histograms shown by figure 9.4 on page 249.
```{r}
library(mosaic)
library(readr)
options(digits = 3)
Forbes <- read.csv("http://nhorton.people.amherst.edu/sdm4/data/Forbes_Global_2000.csv")
gf_histogram(~ Assets..B., data = Forbes,
center = 100, binwidth = 200, type = "count",
xlab = "Assets ($B)", ylab = "# of Companies")
```
As `Assets..B.` are the assets in billions, we have to add 9 (log(1,000,000,000)) to each value of log(Assets..B.) to get log(Assets)
```{r}
gf_histogram(~ (log(Assets..B., 10) + 9), data = Forbes,
center = 0.25/2, binwidth = 0.25, type = "count",
xlab = "Log(Assets)", ylab = "# of Companies")
```
To group by whether the `Sector` is Finance or not, we use the `mutate` and `ifelse` functions. Then the scatterplot and histogram of figure 9.7 on page 251 can be generated by utilizing the `groups = ` query.
```{r}
Forbes <- mutate(Forbes, isFin = ifelse(Sector == "Finance", TRUE, FALSE))
gf_point(Sales ~ (log(Assets..B., 10)), data = Forbes,
color = ~ isFin, auto.key = "true",
xlab = "Log(Assets($))", ylab = "Sales")
gf_histogram( ~ (log(Assets..B., 10)), data = Forbes,
fill = ~ isFin, type = "count", stripes = "horizontal",
binwidth = 0.75/3, center = 0.75/6,
xlab = "Log(Assets($))")
```
### Section 9.2: Finding a Good Re-expression
Looking at the penguins example mentioned on page 251 we can see how different log transformations affect the `xyplot` of the two variables:
```{r}
Penguins <- read.csv("http://nhorton.people.amherst.edu/sdm4/data/Penguins.csv")
gf_point(DiveHeartRate ~ Duration, data = Penguins,
main = "No Transformation", xlab = "Dive Duration (min)", ylab = "# of Dives") %>%
gf_lm()
gf_point(log(DiveHeartRate) ~ Duration, data = Penguins,
main = "Y Transformation", xlab = "log(Dive Duration (min))", ylab = "# of Dives") %>%
gf_lm()
gf_point(log(DiveHeartRate) ~ log(Duration), data = Penguins,
main = "X and Y Transformations", xlab = "log(Dive Duration (min))",
ylab = "log(# of Dives)") %>%
gf_lm()
```