Randall Pruim (Calvin College) + Nicholas Horton (Amherst College)
Albany Chapter shortcourse (May, 2015)
NSF-funded project to develop a new way to introduce mathematics, statistics, computation and modeling to students in colleges and universities.
more information at mosaic-web.org
the mosaic
package is available via
The new support for document creation in RStudio is great.
These slides are HTML, but I created them in RMarkdown (+ a little bit of HTML fiddling)
Slides can also be created as PDF created using beamer – but again, you only need to write RMarkdown
A single RMarkdown file can generate PDF, HTML, or Word
A lot of times you end up putting in a lot more volume, because you are teaching fundamentals and you are teaching concepts that you need to put in, but you may not necessarily use because they are building blocks for other concepts and variations that will come off of that … In the offseason you have a chance to take a step back and tailor it more specifically towards your team and towards your players.“
Mike McCarthy, Head Coach, Green Bay Packers |
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
— Antoine de Saint-Exupery (writer, poet, pioneering aviator) |
One key to successfully introducing R is finding a set of commands that is
It is not enough to use R, it must be used elegantly.
Two examples of this principle:
Goal: a minimal set of R commands for Intro Stats
Result: Minimal R Vignette
Much of the work on the mosaic
package has been motivated
by
# simpler version
goal( ~ x, data = mydata )
# fancier version
goal( y ~ x | z , data = mydata )
# unified version
goal( formula , data = mydata )
xyplot()
)births ~ date
)data=Births78
)
?Births78
for documentation
xyplot( births ~ date, data=Births78)
Command: bwplot()
The data: HELPrct
age
, substance
?HELPrct
for info about databwplot(age ~ substance, data=HELPrct)
1. Command: bwplot()
HELPrct
age
, substance
?HELPrct
for info about databwplot(substance ~ age, data=HELPrct)
histogram(~ age, data=HELPrct)
Note: When there is one variable it is on the right side of the formula.
histogram(~age, data=HELPrct)
densityplot(~age, data=HELPrct)
bwplot(~age, data=HELPrct)
qqmath(~age, data=HELPrct)
freqpolygon(~age, data=HELPrct)
bargraph(~sex, data=HELPrct)
xyplot(i1 ~ age, data=HELPrct)
bwplot(age ~ substance, data=HELPrct)
bwplot(substance ~ age, data=HELPrct)
histogram()
, qqmath()
, densityplot()
, freqpolygon()
, bargraph()
xyplot()
, bwplot()
Create a plot of your own choosing with one of these data sets
require(mosaicData)
names(Utilities) # utility bill data
?Utilities
require(NHANES)
names(NHANES) # body shape, etc.
?NHANES
groups =
group to overlay.y ~ x | z
to create multipanel plots.densityplot( ~ age | sex, data=HELPrct,
groups=substance,
auto.key=TRUE)
My approach:
require(lubridate)
xyplot(births ~ date, data=Births78,
groups=wday(date, label=TRUE, abbr=TRUE),
type='l',
auto.key=list(columns=4, lines=TRUE, points=FALSE),
par.settings=list(
superpose.line=list( lty=1 ) ))
July 1978
Su Mo Tu We Th Fr Sa
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31
Big idea: Replace plot name with summary name
histogram(~ age, data=HELPrct)
mean(~ age, data=HELPrct)
[1] 35.65342
The mosaic package includes formula aware versions of
mean()
,
sd()
,
var()
,
min()
,
max()
,
sum()
,
IQR()
, …
Also provides favstats()
to compute our favorites.
favstats(~ age, data=HELPrct)
min Q1 median Q3 max mean sd n missing
19 30 35 40 60 35.65342 7.710266 453 0
tally(~ sex, data=HELPrct)
female male
107 346
tally(~ substance, data=HELPrct)
alcohol cocaine heroin
177 152 124
Three ways to think about this. All do the same thing.
sd( age ~ substance, data=HELPrct)
sd(~ age | substance, data=HELPrct)
sd(~ age, groups=substance, data=HELPrct)
alcohol cocaine heroin
7.652272 6.692881 7.986068
tally(sex ~ substance, data=HELPrct)
substance
sex alcohol cocaine heroin
female 36 41 30
male 141 111 94
tally(~ sex + substance, data=HELPrct)
substance
sex alcohol cocaine heroin
female 36 41 30
male 141 111 94
mean(age ~ substance | sex, data=HELPrct)
A.F C.F H.F A.M C.M H.M F M
39.16667 34.85366 34.66667 37.95035 34.36036 33.05319 36.25234 35.46821
mean(age ~ substance | sex, data=HELPrct,
.format="table" )
substance sex mean
1 A F 39.16667
2 A M 37.95035
3 C F 34.85366
4 C M 34.36036
5 H F 34.66667
6 H M 33.05319
median()
, min()
, max()
, sd()
, var()
, favstats()
, etc. mean(age ~ sex, data=HELPrct)
bwplot(age ~ sex, data=HELPrct)
lm(age ~ sex, data=HELPrct)
female male
36.25234 35.46821
(Intercept) sexmale
36.2523364 -0.7841284
The mosaic
package includes some other things, too
xchisq.test()
, xpnorm()
, xqqmath()
mplot()
mplot(HELPrct)
interactive plot creationplot()
in some situationshistogram()
controls (e.g., width
)xpnorm( 700, mean=500, sd=100)
If X ~ N(500,100), then
P(X <= 700) = P(Z <= 2) = 0.9772
P(X > 700) = P(Z > 2) = 0.0228
[1] 0.9772499
xpnorm( c(300, 700), mean=500, sd=100)
If X ~ N(500,100), then
P(X <= 300) = P(Z <= -2) = 0.0228
P(X <= 700) = P(Z <= 2) = 0.9772
P(X > 300) = P(Z > -2) = 0.9772
P(X > 700) = P(Z > 2) = 0.0228
[1] 0.02275013 0.97724987
xchisq.test(phs)
Pearson's Chi-squared test with Yates' continuity correction
data: x
X-squared = 24.4291, df = 1, p-value = 7.71e-07
104.00 10933.00
( 146.52) (10890.48)
[12.05] [ 0.16]
<-3.51> < 0.41>
189.00 10845.00
( 146.48) (10887.52)
[12.05] [ 0.16]
< 3.51> <-0.41>
key:
observed
(expected)
[contribution to X-squared]
<residual>
Modeling is really the starting point for the mosaic
design.
lm()
and glm()
) defined the templatemodel <- lm(width ~ length * sex,
data=KidsFeet)
Width <- makeFun(model)
Width( length=25, sex="B")
1
9.167675
Width( length=25, sex="G")
1
8.939312
xyplot( width ~ length, data=KidsFeet,
groups=sex, auto.key=TRUE )
plotFun( Width(length, sex="B") ~ length,
col=1, add=TRUE)
plotFun( Width(length, sex="G") ~ length,
col=2, add=TRUE)
Often used on first day of class
Story
Use rflip()
to simulate flipping coins
rflip()
Flipping 1 coin [ Prob(Heads) = 0.5 ] ...
H
Number of Heads: 1 [Proportion Heads: 1]
Faster if we flip multiple coins at once:
rflip(10)
Flipping 10 coins [ Prob(Heads) = 0.5 ] ...
H H H T T T H H H T
Number of Heads: 6 [Proportion Heads: 0.6]
heads
= correct; tails
= incorrect than to compare with a given pattern
rflip(10)
simulates 1 lady tasting 10 cups 1 time.
We can do that many times to see how guessing ladies do:
do(2) * rflip(10)
n heads tails prop
1 10 6 4 0.6
2 10 5 5 0.5
do()
is clever about what it remembersLadies <- do(5000) * rflip(10)
head(Ladies, 1)
n heads tails prop
1 10 4 6 0.4
histogram( ~ heads, data=Ladies, width=1 )
tally(~ (heads >= 9), data=Ladies)
TRUE FALSE
52 4948
tally(~ (heads >= 9), data=Ladies,
format="prop")
TRUE FALSE
0.0104 0.9896
tally(~ (heads >= 9), data=Ladies,
format="prop")
TRUE FALSE
0.0104 0.9896
prop(~ (heads >= 9), data=Ladies)
TRUE
0.0104
shuffle()
or resample()
diffmean(age ~ sex, data=HELPrct)
diffmean
-0.7841284
do(1) *
diffmean(age ~ shuffle(sex), data=HELPrct)
diffmean
1 0.1090973
Null <- do(5000) *
diffmean(age ~ shuffle(sex), data=HELPrct)
prop( ~(abs(diffmean) > 0.7841), data=Null )
TRUE
0.3616
histogram(~ diffmean, data=Null, v=-.7841)
Bootstrap <- do(5000) *
diffmean(age~sex, data= resample(HELPrct))
histogram(~ diffmean, data=Bootstrap,
v=-.7841, glwd=4 )
cdata(~ diffmean, data=Bootstrap, p=.95)
low hi central.p
-2.479 0.889 0.950
confint(Bootstrap, method="quantile")
name lower upper level method
1 diffmean -2.48 0.889 0.95 quantile
confint(Bootstrap) # default uses st. err
name lower upper level method estimate margin.of.error
1 diffmean -2.43 0.889 0.95 stderr -0.77 1.66
do(1) * lm(width ~ length, data=KidsFeet)
Intercept length sigma r.squared F
1 2.86 0.248 0.396 0.411 25.8
do(3) * lm( width ~ shuffle(length), data=KidsFeet)
Intercept length sigma r.squared F
1 11.64 -0.1071 0.496 0.07664 3.071
2 11.54 -0.1031 0.498 0.07110 2.832
3 9.75 -0.0308 0.515 0.00635 0.236
do(1) *
lm(width ~ length + sex, data=KidsFeet)
Intercept length sexG sigma r.squared F
1 3.641168 0.221025 -0.2325175 0.3848905 0.4595428 15.30513
do(3) *
lm(width ~ length + shuffle(sex),
data=KidsFeet)
Intercept length sexG sigma r.squared F
1 2.942851 0.2431545 0.07785347 0.3998086 0.4168355 12.86608
2 3.450856 0.2282462 -0.20833605 0.3878261 0.4512672 14.80285
3 2.842685 0.2484316 0.01565872 0.4017205 0.4112447 12.57297
Null <- do(5000) *
lm(width ~ length + shuffle(sex),
data=KidsFeet)
histogram(~ sexG, data=Null,
v=-0.2325, glwd=4)
histogram(~ sexG, data=Null,
v=-0.2325, glwd=4)
prop(~ (sexG <= -0.2325), data=Null)
TRUE
0.037
R Markdown
Ideas for the First Day
Next Steps