Less Volume, More Creativity

Randall Pruim (Calvin College) + Nicholas Horton (Amherst College)
Albany Chapter shortcourse (May, 2015)

Project MOSAIC and the `mosaic` package

NSF-funded project to develop a new way to introduce mathematics, statistics, computation and modeling to students in colleges and universities.

  • more information at mosaic-web.org

  • the mosaic package is available via

    • CRAN
    • github
      • updates more frequently than CRAN
      • beta branch for developing new features

A note about the slides

The new support for document creation in RStudio is great.

  • These slides are HTML, but I created them in RMarkdown (+ a little bit of HTML fiddling)

  • Slides can also be created as PDF created using beamer – but again, you only need to write RMarkdown

  • A single RMarkdown file can generate PDF, HTML, or Word

    • no need to know HTML, LateX or Word
    • but if you do, you can take advantage

Less Volume, More Creativity

A lot of times you end up putting in a lot more volume, because you are teaching fundamentals and you are teaching concepts that you need to put in, but you may not necessarily use because they are building blocks for other concepts and variations that will come off of that … In the offseason you have a chance to take a step back and tailor it more specifically towards your team and towards your players.“

Mike McCarthy, Head Coach, Green Bay Packers

SIBKIS: See It Big, Keep It Simple

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

— Antoine de Saint-Exupery (writer, poet, pioneering aviator)

Less Volume, More Creativity

One key to successfully introducing R is finding a set of commands that is

  • small: fewer is better
  • coherent: commands should be as similar as possible
  • powerful: can do what needs doing

It is not enough to use R, it must be used elegantly.

Two examples of this principle:

  • the mosaic package
  • the dplyr package (Hadley Wickham)

Minimal R

Goal: a minimal set of R commands for Intro Stats

Result: Minimal R Vignette

Much of the work on the mosaic package has been motivated by

  • The Less Volume, More Creativity approach
  • The Minimal R goal

A few little details

R is case sensitive

  • many students are not case sensitive

Arrows and Tab

  • up/down arrows scroll through history
  • TAB completion can simplify typing

If all else fails, try ESC

  • If you see a + prompt, it means R is waiting for more input
  • If this is unintentional, you probably have a typo
  • ESC will get you pack to the command prompt

The Most Important Template

 

goal ( yyy ~ xxx , data = mydata )

 

The Most Important Template

 

goal (  y  ~  x  , data = mydata )

The Most Important Template

 

goal (  y  ~  x  , data = mydata , …)

 

Other versions:

# simpler version
goal( ~ x, data = mydata )          
# fancier version
goal( y ~ x | z , data = mydata )   
# unified version
goal( formula , data = mydata )     

2 Questions

 

goal (  y  ~  x  , data = mydata )

 

What do you want R to do? (goal)

 

What must R know to do that?

2 Questions

 

goal (  y  ~  x  , data = mydata )

 

What do you want R to do? (goal)

  • This determines the function to use

What must R know to do that?

  • This determines the inputs to the function
  • Must identify the variables and data frame

How do we make this plot?

plot of chunk unnamed-chunk-4

How do we make this plot?

plot of chunk unnamed-chunk-5

What is the Goal?

What does R need to know?

How do we make this plot?

plot of chunk unnamed-chunk-6

What is the Goal?

  • a scatter plot

What does R need to know?

  • which variable goes where
  • which data set

How do we tell R to make this plot?

plot of chunk unnamed-chunk-7

What is the Goal?

  • a scatter plot (xyplot())

What does R need to know?

  • which variable goes where (births ~ date)
  • which data set (data=Births78)
    • use ?Births78 for documentation

How do we make this plot?

 

goal (  y  ~  x  , data = mydata )

xyplot( births ~ date, data=Births78) 

plot of chunk unnamed-chunk-8

Your turn: How do you make this plot?

plot of chunk unnamed-chunk-9

Two Questions?

Your turn: How do you make this plot?

plot of chunk unnamed-chunk-10

  1. Command: bwplot()

  2. The data: HELPrct

    • Variables: age, substance
    • use ?HELPrct for info about data

Your turn: How do you make this plot?

bwplot(age ~ substance, data=HELPrct)

plot of chunk unnamed-chunk-11

Your turn: How about this one?

plot of chunk unnamed-chunk-12 1. Command: bwplot()

  1. The data: HELPrct
    • Variables: age, substance
    • use ?HELPrct for info about data

Your turn: How about this one?

bwplot(substance ~ age, data=HELPrct)

plot of chunk unnamed-chunk-13

Graphical Summaries: One Variable

histogram(~ age, data=HELPrct) 

plot of chunk unnamed-chunk-14

Note: When there is one variable it is on the right side of the formula.

Graphical Summaries: Overview

One Variable

  histogram(~age, data=HELPrct) 
densityplot(~age, data=HELPrct) 
     bwplot(~age, data=HELPrct) 
     qqmath(~age, data=HELPrct) 
freqpolygon(~age, data=HELPrct) 
   bargraph(~sex, data=HELPrct)

Two Variables

xyplot(i1 ~ age,        data=HELPrct) 
bwplot(age ~ substance, data=HELPrct) 
bwplot(substance ~ age, data=HELPrct) 
  • i1 average number of drinks (standard units) consumed per day in the past 30 days (measured at baseline)

The Graphics Template

One variable

plotname ( ~  x  , data = mydata , …)

  • histogram(), qqmath(), densityplot(), freqpolygon(), bargraph()

 

Two Variables

plotname (  y  ~  x  , data = mydata , …)

  • xyplot(), bwplot()

Your turn

Create a plot of your own choosing with one of these data sets

require(mosaicData)
names(Utilities)   # utility bill data
?Utilities
require(NHANES)
names(NHANES)      # body shape, etc.
?NHANES

groups and panels

  • Add groups =group to overlay.
  • Use y ~ x | z to create multipanel plots.
densityplot( ~ age | sex, data=HELPrct,  
               groups=substance,  
               auto.key=TRUE)   

plot of chunk unnamed-chunk-19

Bells & Whistles

  • titles
  • axis labels
  • colors
  • sizes
  • transparency
  • etc, etc.

My approach:

  • Let the students ask or
  • Let the data analysis drive

Bells and Whistles

require(lubridate)
xyplot(births ~ date, data=Births78,  
  groups=wday(date, label=TRUE, abbr=TRUE), 
  type='l',
  auto.key=list(columns=4, lines=TRUE, points=FALSE),
  par.settings=list(
    superpose.line=list( lty=1 ) ))

plot of chunk unnamed-chunk-20

     July 1978
Su Mo Tu We Th Fr Sa
                   1
 2  3  4  5  6  7  8
 9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31

Numerical Summaries: One Variable

Big idea: Replace plot name with summary name

  • Nothing else changes
histogram(~ age, data=HELPrct)
     mean(~ age, data=HELPrct)
[1] 35.65342

plot of chunk unnamed-chunk-22

Other Summaries

The mosaic package includes formula aware versions of mean(), sd(), var(), min(), max(), sum(), IQR(), …

Also provides favstats() to compute our favorites.

favstats(~ age, data=HELPrct)
 min Q1 median Q3 max     mean       sd   n missing
  19 30     35 40  60 35.65342 7.710266 453       0

Tallying

tally(~ sex, data=HELPrct)

female   male 
   107    346 
tally(~ substance, data=HELPrct)

alcohol cocaine  heroin 
    177     152     124 

Numerical Summaries: Two Variables

Three ways to think about this. All do the same thing.

sd(  age ~ substance, data=HELPrct)
sd(~ age | substance, data=HELPrct)
sd(~ age, groups=substance, data=HELPrct)
 alcohol  cocaine   heroin 
7.652272 6.692881 7.986068 

Numerical Summaries: Tables

tally(sex ~ substance, data=HELPrct)
        substance
sex      alcohol cocaine heroin
  female      36      41     30
  male       141     111     94
tally(~ sex + substance, data=HELPrct)
        substance
sex      alcohol cocaine heroin
  female      36      41     30
  male       141     111     94

Numerical Summaries

mean(age ~ substance | sex, data=HELPrct)
     A.F      C.F      H.F      A.M      C.M      H.M        F        M 
39.16667 34.85366 34.66667 37.95035 34.36036 33.05319 36.25234 35.46821 
mean(age ~ substance | sex, data=HELPrct, 
      .format="table" )
  substance sex     mean
1         A   F 39.16667
2         A   M 37.95035
3         C   F 34.85366
4         C   M 34.36036
5         H   F 34.66667
6         H   M 33.05319
  • I've abbreviated the names to make things fit on slide
  • Also works for median(), min(), max(), sd(), var(), favstats(), etc.

One Template to Rule a Lot

  • single and multiple variable graphical summaries
  • single and multiple variabble numerical summaries
  • linear models
  mean(age ~ sex, data=HELPrct)
bwplot(age ~ sex, data=HELPrct) 
    lm(age ~ sex, data=HELPrct)
  female     male 
36.25234 35.46821 
(Intercept)     sexmale 
 36.2523364  -0.7841284 

Some other things

The mosaic package includes some other things, too

  • Data sets (you've already seen some of them)
  • xtras: xchisq.test(), xpnorm(), xqqmath()
  • mplot()
    • mplot(HELPrct) interactive plot creation
    • replacements for plot() in some situations
  • simplified histogram() controls (e.g., width)
  • simplified ways to add onto lattice plots

xpnorm()

xpnorm( 700, mean=500, sd=100)

If X ~ N(500,100), then 

    P(X <= 700) = P(Z <= 2) = 0.9772
    P(X >  700) = P(Z >  2) = 0.0228

plot of chunk unnamed-chunk-33

[1] 0.9772499

xpnorm()

xpnorm( c(300, 700), mean=500, sd=100)

If X ~ N(500,100), then 

    P(X <= 300) = P(Z <= -2) = 0.0228
    P(X <= 700) = P(Z <= 2) = 0.9772
    P(X >  300) = P(Z >  -2) = 0.9772
    P(X >  700) = P(Z >  2) = 0.0228

plot of chunk unnamed-chunk-34

[1] 0.02275013 0.97724987

xchisq.test()

xchisq.test(phs)

    Pearson's Chi-squared test with Yates' continuity correction

data:  x
X-squared = 24.4291, df = 1, p-value = 7.71e-07

   104.00   10933.00 
(  146.52) (10890.48)
[12.05]  [ 0.16] 
<-3.51>  < 0.41> 

   189.00   10845.00 
(  146.48) (10887.52)
[12.05]  [ 0.16] 
< 3.51>  <-0.41> 

key:
    observed
    (expected)
    [contribution to X-squared]
    <residual>

Modeling

Modeling is really the starting point for the mosaic design.

  • linear models (lm() and glm()) defined the template
  • lattice graphics use the template (so we chose lattice)
  • we added functionality so numerical summaries can be done with the same template
  • additional things added to make modeling easier for beginners.

Models as Functions

model <- lm(width ~ length * sex, 
            data=KidsFeet)
Width <- makeFun(model)
Width( length=25, sex="B")
       1 
9.167675 
Width( length=25, sex="G")
       1 
8.939312 

Models as Functions -- Plotting

xyplot( width ~ length, data=KidsFeet, 
        groups=sex, auto.key=TRUE )
plotFun( Width(length, sex="B") ~ length, 
         col=1, add=TRUE)
plotFun( Width(length, sex="G") ~ length, 
         col=2, add=TRUE)

plot of chunk unnamed-chunk-39

Resampling -- You can do() it!

Lady Tasting Tea

  • Often used on first day of class

  • Story

    • woman claims she can tell whether milk has been poured into tea or vice versa.
    • Question: How do we test this claim?

Computer Simulation

Use rflip() to simulate flipping coins

rflip()

Flipping 1 coin [ Prob(Heads) = 0.5 ] ...

H

Number of Heads: 1 [Proportion Heads: 1]

Computer Simulation

Faster if we flip multiple coins at once:

rflip(10)

Flipping 10 coins [ Prob(Heads) = 0.5 ] ...

H H H T T T H H H T

Number of Heads: 6 [Proportion Heads: 0.6]
  • easier to consider heads = correct; tails = incorrect than to compare with a given pattern
    • this switch bothers me more than it bothers my students

Let's do that a lot of times

rflip(10) simulates 1 lady tasting 10 cups 1 time.

We can do that many times to see how guessing ladies do:

do(2) * rflip(10)
   n heads tails prop
1 10     6     4  0.6
2 10     5     5  0.5
  • do() is clever about what it remembers
  • 2 isn't many – we'll do many next.

The distribution of guessing ladies

Ladies <- do(5000) * rflip(10)
head(Ladies, 1)
   n heads tails prop
1 10     4     6  0.4
histogram( ~ heads, data=Ladies, width=1 )

plot of chunk unnamed-chunk-46

How often does guessing score 9 or 10?

tally(~ (heads >= 9), data=Ladies)

 TRUE FALSE 
   52  4948 

How often does guessing score 9 or 10?

tally(~ (heads >= 9), data=Ladies, 
                       format="prop")

  TRUE  FALSE 
0.0104 0.9896 

How often does guessing score 9 or 10?

tally(~ (heads >= 9), data=Ladies, 
                       format="prop")

  TRUE  FALSE 
0.0104 0.9896 
 prop(~ (heads >= 9), data=Ladies)
  TRUE 
0.0104 

A general approach to randomization

  1. Do it for your data
  2. Do it for “random” data
  3. Do it lots of times for “random” data
  • definition of “random” is important, but can often be handled by shuffle() or resample()

Example: Do mean ages differ by sex?

diffmean(age ~ sex, data=HELPrct)
  diffmean 
-0.7841284 
do(1) * 
  diffmean(age ~ shuffle(sex), data=HELPrct)
   diffmean
1 0.1090973
Null <- do(5000) * 
  diffmean(age ~ shuffle(sex), data=HELPrct)

Example: Do mean ages differ by sex?

prop( ~(abs(diffmean) > 0.7841), data=Null )
  TRUE 
0.3616 
histogram(~ diffmean, data=Null, v=-.7841) 

plot of chunk unnamed-chunk-51

Example: Bootstrap for difference in means

Bootstrap <- do(5000) * 
  diffmean(age~sex, data= resample(HELPrct))

histogram(~ diffmean, data=Bootstrap, 
                      v=-.7841, glwd=4 )

plot of chunk unnamed-chunk-52

Example: CI for difference in mean ages

cdata(~ diffmean, data=Bootstrap, p=.95)
      low        hi central.p 
   -2.479     0.889     0.950 
confint(Bootstrap, method="quantile")
      name lower upper level   method
1 diffmean -2.48 0.889  0.95 quantile

Example: CI for difference in mean ages

confint(Bootstrap)  # default uses st. err
      name lower upper level method estimate margin.of.error
1 diffmean -2.43 0.889  0.95 stderr    -0.77            1.66

Randomization and linear models

do(1) * lm(width ~ length, data=KidsFeet)
  Intercept length sigma r.squared    F
1      2.86  0.248 0.396     0.411 25.8
do(3) * lm( width ~ shuffle(length), data=KidsFeet)
  Intercept  length sigma r.squared     F
1     11.64 -0.1071 0.496   0.07664 3.071
2     11.54 -0.1031 0.498   0.07110 2.832
3      9.75 -0.0308 0.515   0.00635 0.236

Randomization and linear models

do(1) * 
  lm(width ~ length + sex, data=KidsFeet)
  Intercept   length       sexG     sigma r.squared        F
1  3.641168 0.221025 -0.2325175 0.3848905 0.4595428 15.30513
do(3) * 
  lm(width ~ length + shuffle(sex), 
                       data=KidsFeet)
  Intercept    length        sexG     sigma r.squared        F
1  2.942851 0.2431545  0.07785347 0.3998086 0.4168355 12.86608
2  3.450856 0.2282462 -0.20833605 0.3878261 0.4512672 14.80285
3  2.842685 0.2484316  0.01565872 0.4017205 0.4112447 12.57297

Randomization and linear models

Null <- do(5000) * 
  lm(width ~ length + shuffle(sex), 
                       data=KidsFeet)
histogram(~ sexG, data=Null, 
           v=-0.2325, glwd=4)

plot of chunk unnamed-chunk-58

Randomization and linear models

histogram(~ sexG, data=Null, 
           v=-0.2325, glwd=4)

plot of chunk unnamed-chunk-59

prop(~ (sexG <= -0.2325), data=Null)
 TRUE 
0.037 

Next Up

  • R Markdown

  • Ideas for the First Day

  • Next Steps