1 Introduction

Load packages:

library(tidyverse)
library(ggplot2) # superfluous because ggplot2 is part of tidyverse

library(haven)
library(labelled)

Resources used to create this lecture:

1.1 Datasets we will use

We will use two datasets that are part of the ggplot2 package:

  • mpg: EPA fuel economy data in 1999 and 2008 for 38 car models that had a new release every year between 1999 and 2008
    • Note: There are no set of variables that uniquely identify observations
  • diamonds: Prices and attributes of about 54,000 diamonds
#?mpg
glimpse(mpg)
## Observations: 234
## Variables: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "au…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quatt…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 199…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)",…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17,…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25,…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "…
## $ class        <chr> "compact", "compact", "compact", "compact", "compac…
#?diamonds
glimpse(diamonds)
## Observations: 53,940
## Variables: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Goo…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1,…
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, …
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 3…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.…

We will use public-use data from the National Center for Education Statistics (NCES) Educational Longitudinal Survey (ELS) of 2002:

  • Follows 10th graders from 2002 until 2012
  • Variable stu_id uniquely identifies observations
# variables we want to select from full ELS dataset
els_keepvars <- c(
    "STU_ID",        # student id
    "STRAT_ID",      # stratum id
    "PSU",           # primary sampling unit
    "BYRACE",        # (base year) race/ethnicity 
    "BYINCOME",      # (base year) parental income
    "BYPARED",       # (base year) parental education
    "BYNELS2M",      # (base year) math score
    "BYNELS2R",      # (base year) reading score
    "F3ATTAINMENT",  # (3rd follow up) attainment
    "F2PS1SEC",      # (2nd follow up) first institution attended
    "F3ERN2011",     # (3rd follow up) earnings from employment in 2011
    "F1SEX",         # (1st follow up) sex composite
    "F2EVRATT",      # (2nd follow up, composite) ever attended college
    "F2PS1LVL",      # (2nd follow up, composite) first attended postsecondary institution, level 
    "F2PS1CTR",      # (2nd follow up, composite) first attended postsecondary institution, control
    "F2PS1SLC"       # (2nd follow up, composite) first attended postsecondary institution, selectivity
)
els_keepvars
##  [1] "STU_ID"       "STRAT_ID"     "PSU"          "BYRACE"      
##  [5] "BYINCOME"     "BYPARED"      "BYNELS2M"     "BYNELS2R"    
##  [9] "F3ATTAINMENT" "F2PS1SEC"     "F3ERN2011"    "F1SEX"       
## [13] "F2EVRATT"     "F2PS1LVL"     "F2PS1CTR"     "F2PS1SLC"
load(url("https://github.com/anyone-can-cook/rclass2/raw/master/data/els/els.RData"))

els <- els %>%
  # keep only subset of vars
  select(one_of(els_keepvars)) %>%
  # lower variable names
  rename_all(tolower)

glimpse(els)
## Observations: 16,197
## Variables: 16
## $ stu_id       <dbl> 101101, 101102, 101104, 101105, 101106, 101107, 101…
## $ strat_id     <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 1…
## $ psu          <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ byrace       <dbl+lbl> 5, 2, 7, 3, 4, 4, 4, 7, 4, 3, 3, 4, 3, 2, 2, 3,…
## $ byincome     <dbl+lbl> 10, 11, 10, 2, 6, 9, 10, 10, 8, 3, 8, 8, 5, 8, …
## $ bypared      <dbl+lbl> 5, 5, 2, 2, 1, 2, 6, 2, 2, 1, 6, 4, 4, 2, 7, 2,…
## $ bynels2m     <dbl+lbl> 47.84, 55.30, 66.24, 35.33, 29.97, 24.28, 45.16…
## $ bynels2r     <dbl+lbl> 39.04, 36.35, 42.68, 27.86, 13.07, 11.70, 19.66…
## $ f3attainment <dbl+lbl> 3, 10, 6, 4, 4, 3, 4, 6, -4, 3, 3, 3, 5, 5, 6, …
## $ f2ps1sec     <dbl+lbl> -8, 1, 1, 4, 4, -3, 4, 2, -4, 4, 1, -4, -4, 4, …
## $ f3ern2011    <dbl+lbl> 4000, 3000, 37000, 1500, 48000, 35000, 17000, 6…
## $ f1sex        <dbl+lbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1,…
## $ f2evratt     <dbl+lbl> -8, 1, 1, 1, 1, 0, 1, 1, -4, 1, 1, -4, -4, 1, 1…
## $ f2ps1lvl     <dbl+lbl> -8, 1, 1, 2, 2, -3, 2, 1, -4, 2, 1, -4, -4, 2, …
## $ f2ps1ctr     <dbl+lbl> -8, 1, 1, 1, 1, -3, 1, 2, -4, 1, 1, -4, -4, 1, …
## $ f2ps1slc     <dbl+lbl> -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5,…
els %>% var_label()
## $stu_id
## [1] "Student ID"
## 
## $strat_id
## [1] "Stratum"
## 
## $psu
## [1] "Primary sampling unit"
## 
## $byrace
## [1] "Student's race/ethnicity-composite"
## 
## $byincome
## [1] "Total family income from all sources 2001-composite"
## 
## $bypared
## [1] "Parents' highest level of education"
## 
## $bynels2m
## [1] "ELS-NELS 1992 scale equated sophomore math score"
## 
## $bynels2r
## [1] "ELS-NELS 1992 scale equated sophomore reading score"
## 
## $f3attainment
## [1] "Highest level of education earned as of F3"
## 
## $f2ps1sec
## [1] "Sector of first postsecondary institution"
## 
## $f3ern2011
## [1] "2011 employment income:  R only"
## 
## $f1sex
## [1] "F1 sex-composite"
## 
## $f2evratt
## [1] "Whether has ever attended a postsecondary institution - composite"
## 
## $f2ps1lvl
## [1] "Level of offering of first postsecondary institution"
## 
## $f2ps1ctr
## [1] "Control of first postsecondary institution"
## 
## $f2ps1slc
## [1] "Institutional selectivity of first attended postsecondary institution"

2 Concepts

Basic definitions:

  • Grammar
    • “The fundamental principles or rules of an art or science” (Oxford English dictonary)
  • Grammar of graphics (Wilkinson, 1999)
    • Principles/rules to describe and construct statistical graphics
  • Layered grammar of graphics (Wickham, 2010)
    • Principles/rules to describe and construct statistical graphics “based around the idea of building up a graphic from multiple layers of data” (Wickham, 2010, p. 4)
    • The layered grammar of graphics is a “formal system for building plots… based on the insight that you can uniquely describe any plot as a combination of” seven paramaters (Wickham & Grolemund, 2017, Chapter 3)
  • Aesthetics
    • Aesthetics are visual elements of the plot (e.g., lines, points, symbols, colors, axes)
    • Aesthetic mappings are visual elements of the plot determined by values of specific variables (e.g., a scatterplot where the color of each point depends on the value of the variable race)
    • However, aesthetics need not be determined by variable values. For example, when creating a scatterplot you may specify that the color of each point be blue.

The seven parameters of the layered grammar of graphics consists of:

  • Five layers
    • A dataset (data)
    • A set of aesthetic mappings (mappings)
    • A statistical transformation (stat)
    • A geometric object (geom)
    • A position adjustment (position)
  • A coordinate system (coord)
  • A faceting scheme (facets)

ggplot2 – part of tidyverse – is an R package to create graphics and ggplot() is a function within the ggplot2 package.

“In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.” (Wickham & Grolemund, 2017, Chapter 3)

Syntax conveying the seven parameters of the layered grammer of graphics:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + 
  <GEOM_FUNCTION>(
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

2.1 Layers

What does Wickham mean by layers? (from “Telling Stories with Data Using the Grammar of Graphics” by Liz Sander)

  • In the grammar of a language, words have different parts of speach (e.g., noun, verb, adjective), with each part of speech performing a different role in a sentence
  • The layered grammar of graphics decomposes a graphic into different layers
    • “These are layers in a literal sense – you can think of them as transparency sheets for an overhead projector, each containing a piece of the graphic, which can be arranged and combined in a variety of ways.”

The five layers of the grammar of graphics:

2.1.1 Dataset (data)

Data defines the information to be visualized.

Example: Imagine a dataset where each observation is a student

  • The variables of interest are high school math test score (bynels2m), earnings in 2011 (f3ern2011), and student sex (f1sex)
glimpse(els)
## Observations: 16,197
## Variables: 16
## $ stu_id       <dbl> 101101, 101102, 101104, 101105, 101106, 101107, 101…
## $ strat_id     <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 1…
## $ psu          <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ byrace       <dbl+lbl> 5, 2, 7, 3, 4, 4, 4, 7, 4, 3, 3, 4, 3, 2, 2, 3,…
## $ byincome     <dbl+lbl> 10, 11, 10, 2, 6, 9, 10, 10, 8, 3, 8, 8, 5, 8, …
## $ bypared      <dbl+lbl> 5, 5, 2, 2, 1, 2, 6, 2, 2, 1, 6, 4, 4, 2, 7, 2,…
## $ bynels2m     <dbl+lbl> 47.84, 55.30, 66.24, 35.33, 29.97, 24.28, 45.16…
## $ bynels2r     <dbl+lbl> 39.04, 36.35, 42.68, 27.86, 13.07, 11.70, 19.66…
## $ f3attainment <dbl+lbl> 3, 10, 6, 4, 4, 3, 4, 6, -4, 3, 3, 3, 5, 5, 6, …
## $ f2ps1sec     <dbl+lbl> -8, 1, 1, 4, 4, -3, 4, 2, -4, 4, 1, -4, -4, 4, …
## $ f3ern2011    <dbl+lbl> 4000, 3000, 37000, 1500, 48000, 35000, 17000, 6…
## $ f1sex        <dbl+lbl> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1,…
## $ f2evratt     <dbl+lbl> -8, 1, 1, 1, 1, 0, 1, 1, -4, 1, 1, -4, -4, 1, 1…
## $ f2ps1lvl     <dbl+lbl> -8, 1, 1, 2, 2, -3, 2, 1, -4, 2, 1, -4, -4, 2, …
## $ f2ps1ctr     <dbl+lbl> -8, 1, 1, 1, 1, -3, 1, 2, -4, 1, 1, -4, -4, 1, …
## $ f2ps1slc     <dbl+lbl> -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5,…
els %>% select(stu_id,bynels2m,f3ern2011,f1sex) %>% as_factor() %>% head(10)
## # A tibble: 10 x 4
##    stu_id bynels2m f3ern2011     f1sex 
##     <dbl> <fct>    <fct>         <fct> 
##  1 101101 47.84    4000          Female
##  2 101102 55.3     3000          Female
##  3 101104 66.24    37000         Female
##  4 101105 35.33    1500          Female
##  5 101106 29.97    48000         Female
##  6 101107 24.28    35000         Male  
##  7 101108 45.16    17000         Male  
##  8 101109 66.01    68000         Male  
##  9 101110 28.28    Nonrespondent Male  
## 10 101111 38.85    42000         Male

2.1.2 Set of mappings (mappings)

Mapping defines how variables in a dataset are applied (mapped) to a graphic.

Example: Consider the previous dataset

  • Map HS math test score to the x-axis
  • Map 2011 income to the y-axis
  • Additionally, if we are creating a scatterplot of test score (x-axis) and income (y-axis), we might use sex to define the color of each point
els %>% select(stu_id,bynels2m,f3ern2011,f1sex) %>% 
  rename(x=bynels2m, y=f3ern2011, color=f1sex) %>% 
  as_factor() %>% head(10)
## # A tibble: 10 x 4
##    stu_id x     y             color 
##     <dbl> <fct> <fct>         <fct> 
##  1 101101 47.84 4000          Female
##  2 101102 55.3  3000          Female
##  3 101104 66.24 37000         Female
##  4 101105 35.33 1500          Female
##  5 101106 29.97 48000         Female
##  6 101107 24.28 35000         Male  
##  7 101108 45.16 17000         Male  
##  8 101109 66.01 68000         Male  
##  9 101110 28.28 Nonrespondent Male  
## 10 101111 38.85 42000         Male

2.1.3 Statistical transformation (stat)

A statistical transformation transforms the underlying data before plotting it.

Example: Imagine creating a scatterplot of the relationship between HS math test score (x-axis) and 2011 income (y-axis)

  • When creating a scatterplot we usually do not transform the data prior to plotting
  • This is the “identity” transformation
els %>% select(stu_id,bynels2m,f3ern2011) %>% rename(x=bynels2m, y=f3ern2011) %>% 
  as_factor() %>% head(10)
## # A tibble: 10 x 3
##    stu_id x     y            
##     <dbl> <fct> <fct>        
##  1 101101 47.84 4000         
##  2 101102 55.3  3000         
##  3 101104 66.24 37000        
##  4 101105 35.33 1500         
##  5 101106 29.97 48000        
##  6 101107 24.28 35000        
##  7 101108 45.16 17000        
##  8 101109 66.01 68000        
##  9 101110 28.28 Nonrespondent
## 10 101111 38.85 42000

Example: Imagine creating a bar chart of the number of students by race/ethnicity

  • Here, we do not plot the raw data. Rather, we count the number of observations for each race/ethnicity category.
  • This count is a statistical transformation
els %>% count(byrace) %>% as_factor()
## # A tibble: 9 x 2
##   byrace                                       n
##   <fct>                                    <int>
## 1 Survey component legitimate skip/NA        305
## 2 Nonrespondent                              648
## 3 Amer. Indian/Alaska Native, non-Hispanic   130
## 4 Asian, Hawaii/Pac. Islander,non-Hispanic  1460
## 5 Black or African American, non-Hispanic   2020
## 6 Hispanic, no race specified                996
## 7 Hispanic, race specified                  1221
## 8 More than one race, non-Hispanic           735
## 9 White, non-Hispanic                       8682

2.1.4 Geometric objects (geoms)

Graphs visually display data, using geometric objects like a point, line, bar, etc.

  • Each geometric object in a graph is called a “geom”
  • “A geom is the geometrical object that a plot uses to represent data” (Wickham & Grolemund, 2017, Chapter 3)
  • “People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms” (Wickham & Grolemund, 2017, Chapter 3)
  • Aesthetics are “visual attributes of the geom” (e.g., color, fill, shape, position) (Grammar of Graphics)
    • Each geom can only display certain aesthetics
    • For example, a “point geom” can only include the aesthetics position, color, shape, and size
  • We can plot the same underlying data using different geoms (e.g., bar chart vs. pie chart)
  • A single graph can layer multiple geoms (e.g., scatterplot with a “line of best fit” layered on top)

2.1.5 Position adjustment (position)

Position adjustment adjusts the position of visual elements in the plot so that these visual elements do not overlap with one another in ways that make the plot difficult to interpret.

Example: The dataset mpg (included in the ggplot2 package) contains variables for the specifications of different cars, with 234 observations

  • Create a scatterplot of the relationship between number of cylinders in the engine (x-axis) and highway miles-per-gallon (y-axis)
  • Below plot is difficult to interpet because many points overlap with one another
ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) +
  geom_point()

  • The jitter position adjustment “adds a small amount of random variation to the location of each point” (from ?geom_jitter)
ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) +
  geom_point(position = "jitter")

2.2 Coordinate system (coord)

“A coordinate system maps the position of objects onto the plane of the plot, and controls how the axes and grid lines are drawn. Plots typically use two coordinates (x,y), but could use any number of coordinates.” (Grammar of Graphics)

Example: Cartesian coordinate system

  • Most plots use the Cartesian coordinate system
x1 <- c(1, 10)
y1 <- c(1, 5)
p <- qplot(x = x1, y = y1, geom = "blank", xlab = NULL, ylab = NULL) +
  theme_bw()

p +
  ggtitle(label = "Cartesian coordinate system")