---
title: "An Introduction to dplyr"
author: "Miles Ott (with additions by Nicholas Horton)"
date: September 8, 2017
output: learnr::tutorial
runtime: shiny_prerendered
---

```{r setup, include = FALSE}
library(learnr)
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(
  echo = TRUE, tidy = FALSE, size = "small",
  fig.width = 8, fig.height = 6)

library(mosaic)
library(ggformula)
library(ggplot2movies)  # gets us movies dataset

Movies <- movies

Slim_Movies <- 
  movies %>% 
  select(title, budget, length)

RecentMovies <-
  movies %>%
  filter(year > 2000) %>%
  filter(mpaa != "") 
```

## Getting Organized

### In order to analyze your data you need to organize your data correctly

- Keeping your data **clean** and **tidy** is an important step of every data project

- Goal: learn how to take the data set you **have** and tidy it up to be the data set you **want** for your analyses


### Tidy Data 

- rows (cases/observational units) and 
- columns (variables).  

- The key is that *every* row is a case and *every* column is a variable.  

- No exceptions.


## Chaining 

The pipe syntax (`%>%`) takes a data frame (or data table) and sends it to the argument of a function.  

- `x %>% f(y)` is the same as `f(x, y)`

- `y %>% f(x, ., z)` is the same as `f(x,y,z)`


## Building Tidy Data 

```{r, eval = FALSE}
object_name <- function_name(arguments) 
object_name <- data_table %>% function_name(arguments) 

object_name <-   
  data_table %>%  
  function_name(arguments) %>%  
  function_name(arguments)
```

- in chaining, the value (on left) %>% is **first argument** to the function (on right)

## 5 Main Data Verbs

Data verbs take data tables as input and give data tables as output

1. `summarise()`: computes summary statistics

### Rows 

2. `filter()`:  subsets unwanted *cases* 
3. `arrange()`: reorders the *cases* 

### Columns

4. `select()`: subsets *variables* (and `rename()` )
5. `mutate()`: transforms the *variable* (and `transmute()` like mutate, returns only new variables)

### Other Data Verbs

- `distinct()`: returns each unique row once
- `sample_n()`: take random row(s)
- `head()`:  grab the first few rows
- `tail()`: grab the last few rows
- `group_by()`: SUCCESSIVE functions are applied to groups
- `ungroup()`:  reverse the grouping action
- `summarise()`:  
    + `min()`, `max()`, `mean()`, `sum()`, `sd()`, `median()`, and `IQR()`
    + `n()`: number of observations in the current group
    + `n_distinct()`: number of unique values of a variable
    + `first_value()`, `last_value()`, and `nth_value(x, n)`: (like `x[1]`, `x[length(x)]`, and `x[n]` )


## Movies Data

The `Movies` data set contains information about movies.

```{r, inspect, exercise = TRUE}
inspect(Movies)
```


```{r, nrow, exercise = TRUE}
Movies %>% nrow()
Movies %>% names()
```


### Selecting some of the variables

#### Exercise

Let's make a data set that has fewer variables.

```{r, slim-movies, exercise = TRUE}
Slim_Movies <- 
  Movies %>% select(title, budget, length)
Slim_Movies %>% names()
```

- Reminder: select() is for *columns*

- Note: this is equivalent code:

```{r eval = FALSE}
Slim_Movies <- 
  Movies %>% select(., title, budget, length)
```


## Smaller data sets

Often it is useful to work with a smaller subset of the data until we get our 
analysis methods figured out, and then to move to the full data.  This makes mistakes
happen more quickly ;-)

### Heads or Tails?

```{r, head-tail, exercise = TRUE}
Slim_Movies %>% head(3)       # first few rows
Slim_Movies %>% tail(3)       # last few rows
```

### Random sample

```{r, random, exercise = TRUE}
Slim_Movies %>% sample_n(3)          # 3 random rows
Slim_Movies %>% sample_frac(.00007)  # random fraction of rows
```


## filter() 

#### Exercise

Let's use only use movies (cases) that have budget information and have shorter titles.

```{r, filter-budget, exercise = TRUE}
Slim_Movies <- 
  Slim_Movies %>% 
  filter(!is.na(budget), nchar(title) < 24)
```


## Creating new variables

#### Exercise

Use `mutate()` or `transmute()` to create a new variable.  How do the two differ?


```{r, mutate, exercise = TRUE}
Slim_Movies %>% 
  mutate(dpm = budget/length) %>%   # try transmute() here, too
  head(6)
```


## One-row summaries with summarise()

Note: `summarize()` works for non New Zealanders, but it conflicts with a function with
that same name in another package.

```{r, summarise, exercise = TRUE}
# number of movies (cases) in movie data
Movies %>% summarise(n())
```


```{r, summarise-2, exercise = TRUE}
# mean and total number of minutes of all the movies
Movies %>% 
  summarise(
    mean_length  = mean(length, na.rm = TRUE), 
    total_length =  sum(length, na.rm = TRUE)
    )
```


## group_by()

`group_by()` is what makes this system really hum.


```{r, group-by, exercise = TRUE}
# mean length of movies in hours for all the movies, broken down by mpaa
Movies %>% 
  mutate(hours = length/60) %>% 
  group_by(mpaa) %>% 
  summarise(mean(hours, na.rm = TRUE))
```


## arrange() reorders the cases

```{r, arrange, exercise = TRUE}
# average length in hours, by mpaa rating, sorted by average length
Movies %>% 
  group_by(mpaa) %>% 
  mutate(hours = length/60) %>% 
  summarise(avg_length = mean(hours, na.rm = TRUE)) %>% 
  arrange(avg_length)
```


## Your Turn 

### A Smaller Data Set

When starting, it can be helpful to work with a small subset of the data.  When you have your data wrangling statements in working order, shift to the entire data table.

```{r, small-subset, exercise = TRUE}
RecentMovies <-
  Movies %>%
  filter(year > 2000) %>%
  filter(mpaa != "") 
```


You can use the smaller data set as you are figuring out the solutions to these 
exercises.  Once have it sussed, you can switch to the full `Movies` data.

### Exercises

#### What is the average IMDB user rating?

```{r, rating, exercise = TRUE}
RecentMovies %>%
  summarise(avg = ????(rating)) 
```


#### What is the average IMDB user rating of movies for each mpaa category?

```{r, rating-by-cat, exercise = TRUE}
RecentMovies %>% 
  group_by(????) %>% 
  summarise(avg = ????(rating))
```

#### How many Action Movies in each year?

```{r, action-movies, exercise = TRUE}
RecentMovies %>%
  group_by(????) %>%
  summarise(Actioncount = sum(????))
```

#### How many Comedies of each mpaa rating in each year?

```{r, comedies, exercise = TRUE}
RecentMovies %>%
  group_by(????, ????) %>%
  summarise(????)
```


#### Track the average IMDB ratings for movies with mpaa "R" over the years.

```{r r-movies, exercise = TRUE}
RMovies <-
  movies %>%
  filter(mpaa == "R") %>%     # just the rated R movies
  group_by(year) %>%          # for each year for each movie title
  summarise(mean_user_rating = ????)

gf_line(mean_user_rating ~ year, data = RMovies)
```


## You are on your way to wrangling and transforming your data with ease

### Keep practicing, keep learning

<div align="center">
![ImpostR Syndrome](images/impostR.JPG)
</div>