Building precursors to data science
Slides and papers
- "Setting the stage for data science: integration of data management skills in introductory and second courses in statistics", Nicholas J. Horton, Benjamin S. Baumer, and Hadley Wickham (CHANCE, 2015), 28(2):40-50, http://arxiv.org/abs/1502.00318, full-text
- "Teaching precursors to data science in introductory and second courses in statistics", Nicholas J. Horton, Benjamin S. Baumer, and Hadley Wickham (2014), http://arxiv.org/abs/1401.3269 plus slides
- "R Markdown: integrating a reproducible analysis tool into introductory statistics", Benjamin S. Baumer, Mine Cetinkaya-Rundel, Andrew Bray, Linda Loi, and Nicholas J. Horton (TISE, 2014), http://arxiv.org/abs/1402.1894
- "Data science in the statistics curricula: preparing students to 'think with data'", Johanna Hardin, Roger Hoerl, Nicholas J. Horton, and Deborah Nolan (2014), http://arxiv.org/abs/1410.3127, syllabi, activities, and related resources
- "Challenges and opportunities for statistics and statistical education: looking back, looking forward", Nicholas J. Horton (The American Statistician, 2015), http://arxiv.org/abs/1503.02188
- Data wrangling, visualization, R Markdown, and Shiny cheat sheets
- Visualizing data manipulation operations (Shiny)
- Second edition of Using R for Data Management, Statistical Analysis, and Graphics, Nicholas J. Horton and Ken Kleinman (2015)
- Slides and recording from February 24, 2015 CAUSE webinar
Airline delays examples
- Stats 101 example files and datasets
- Airline delays example files (nycflights13.pdf,
nycflights13.Rmd) using the nycflights13 package in R
- update R and RStudio to recent versions
- run update.packages()
- run install.packages(c("mosaic", "nycflights13"))
- run download.file("http://www.amherst.edu/~nhorton/precursors/nycflights13.Rmd", "nycflights13.Rmd")
- Airline delays SQL intro slides
- Airline delays example files using small SQLite database (just 2014) using RSQLite and dplyr
- update R and RStudio to recent versions
- run update.packages()
- run install.packages(c("RSQLite", "dplyr", "tidyr", "mosaic", "knitr", "nycflights13", "lubridate", "igraph", "markdown", "maps", "readr"))
- download the following files: load-sqlite.R, test-sqlite.Rmd, airlines.csv, airplanes.csv, airports.csv, 2014.csv.bz2 (95MB)
- set up a new project in RStudio specifying the directory/folder that contains the files that you downloaded
- source the script file load-sqlite.R. This should create the database (called ontime.sqlite3) and display information about three airports.
- test the setup by knitting the Markdown file test-sqlite.Rmd in the same directory where you saved the database (this should generate test-sqlite.pdf as output)
- Airline delays example files using large SQLite database (precursors-sqlite.pdf,
precursors-sqlite.Rmd, ran in 30-150 seconds with indices, approximately 1,000 seconds without)
- find a machine with fast internet and lots of disk space (approximately 50GB needed)
- run install.packages(c("RSQLite", "dplyr", "tidyr", "mosaic", "knitr", "nycflights13", "lubridate", "igraph", "markdown", "maps", "readr"))
- download data for 1987-2008 plus supplemental data sources (airlines, airports, airplanes) from the Data Expo 2009 website
- download data from 2009 to today using the following scripts: 1-download.r and 2-reduce.r
- download the following files: load-sqlite-all.R, airlines.csv, airplanes.csv, and airports.csv
- set up a new project in RStudio specifying the directory/folder that contains the files that you downloaded
- set up the database using the following commands load-sqlite-all.R
- Database vignette from the dplyr package in R
Rail trails example
- "Rail trails and property values: Is there an association?", Nicholas J. Horton and Ella Hartenian (Journal of Statistics Education, 2015), http://www.amstat.org/publications/jse/v23n2/horton.pdf (tall.csv, JSE13-070R2.csv, wide.csv, documentation.docx, railtrails.Rmd)
Mere renovation is not enough
Cobb (2015) paper plus discussion
Funding
Partial support for this work was provided by the National Science Foundation DUE 0920350 (Project MOSAIC).
Nicholas HortonDepartment of Mathematics and Statistics
Amherst College
AC#2239
PO Box 5000
Amherst, MA 01002-5000
413-542-5655 (voice)