---
title: "SDM4 in R: Re-expressing Data: Get it Straight! (Chapter 9)"
author: "Nicholas Horton (nhorton@amherst.edu), Patrick Frenett, and Sarah McDonald"
date: "June 13, 2018"
output: 
  pdf_document:
    fig_height: 3
    fig_width: 6
  html_document:
    fig_height: 3
    fig_width: 5
  word_document:
    fig_height: 4
    fig_width: 6
---


```{r, include = FALSE}
# Don't delete this chunk if you are using the mosaic package
# This loads the mosaic and dplyr packages
require(mosaic)
```

```{r, include = FALSE}
# knitr settings to control how R chunks work.
require(knitr)
opts_chunk$set(
  tidy = FALSE,     # display code as typed
  size = "small"    # slightly smaller font for code
)
```

## Introduction and background 

This document is intended to help describe how to undertake analyses introduced 
as examples in the Fourth Edition of *Stats: Data and Models* (2014) by De Veaux, Velleman, and Bock.
More information about the book can be found at http://wps.aw.com/aw_deveaux_stats_series.  This
file as well as the associated R Markdown reproducible analysis source file used to create it can be found at http://nhorton.people.amherst.edu/sdm4.

This work leverages initiatives undertaken by Project MOSAIC (http://www.mosaic-web.org), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the `mosaic` package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignettes (http://cran.r-project.org/web/packages/mosaic).
A paper describing the mosaic approach was published in the *R Journal*: https://journal.r-project.org/archive/2017/RJ-2017-024.

## Chapter 9: Re-expressing Data: Get it Straight!

### Section 9.1: Straightening Scatterplots - The Four Goals

The `histogram` function will generate the histograms shown by figure 9.4 on page 249.

```{r}
library(mosaic)
library(readr)
options(digits = 3)
Forbes <- read.csv("http://nhorton.people.amherst.edu/sdm4/data/Forbes_Global_2000.csv")

gf_histogram(~ Assets..B., data = Forbes, 
          center = 100, binwidth = 200, type = "count",
          xlab = "Assets ($B)", ylab = "# of Companies")
```

As `Assets..B.` are the assets in billions, we have to add 9 (log(1,000,000,000)) to each value of log(Assets..B.) to get log(Assets)

```{r}
gf_histogram(~ (log(Assets..B., 10) + 9), data = Forbes, 
          center = 0.25/2, binwidth = 0.25, type = "count",
          xlab = "Log(Assets)", ylab = "# of Companies")
```


To group by whether the `Sector` is Finance or not, we use the `mutate` and `ifelse` functions. Then the scatterplot and histogram of figure 9.7 on page 251 can be generated by utilizing the `groups = ` query.

```{r}
Forbes <- mutate(Forbes, isFin = ifelse(Sector == "Finance", TRUE, FALSE))

gf_point(Sales ~ (log(Assets..B., 10)), data = Forbes,
       color = ~ isFin, auto.key = "true",
       xlab = "Log(Assets($))", ylab = "Sales")

gf_histogram( ~ (log(Assets..B., 10)), data = Forbes,
       fill = ~ isFin, type = "count", stripes = "horizontal",
       binwidth = 0.75/3, center = 0.75/6,
       xlab = "Log(Assets($))")

```

### Section 9.2: Finding a Good Re-expression

Looking at the penguins example mentioned on page 251 we can see how different log transformations affect the `xyplot` of the two variables:

```{r}
Penguins <- read.csv("http://nhorton.people.amherst.edu/sdm4/data/Penguins.csv")

gf_point(DiveHeartRate ~ Duration, data = Penguins,
       main = "No Transformation", xlab = "Dive Duration (min)", ylab = "# of Dives") %>%
  gf_lm()

gf_point(log(DiveHeartRate) ~ Duration, data = Penguins,
       main = "Y Transformation", xlab = "log(Dive Duration (min))", ylab = "# of Dives") %>%
  gf_lm()

gf_point(log(DiveHeartRate) ~ log(Duration), data = Penguins,
         main = "X and Y Transformations", xlab = "log(Dive Duration (min))", 
         ylab = "log(# of Dives)") %>%
  gf_lm()
```