Joining data

1 Introduction

Load packages:

library(tidyverse)
library(haven)
library(labelled)
library(kableExtra)

If package not yet installed, then must install before you load. Install in “console” rather than .Rmd file:

  • Generic syntax: install.packages("package")
  • Install “tidyverse”: install.packages("tidyverse")

Note: When we load package, name of package is not in quotes; but when we install package, name of package is in quotes:

  • install.packages("tidyverse")
  • library(tidyverse)

1.1 Datasets we will use: NLS72 data

We will use data from the National Longitudinal Study of 1972 (NLS72)-– a nationally representative longitudinal study of 12th graders in 1972. This dataset includes the postsecondary education transcript file of the NLS72, which contains information on transcripts from NLS72 senior cohort members who reported attending a postsecondary institution after high school.

  • Student level data
  • Student-transcript level data
  • Student-transcript-term level data

These datasets are good for teaching “joining” because you get practice joining data sources with different “observational levels” (e.g., join student level and student-transcript level data).

# NLS student level data, containing variables about student demographics
nls_stu <- read_dta('https://github.com/anyone-can-cook/rclass1/raw/master/data/nls72/nls72stu_percontor_vars.dta') %>%
  select(id, schcode, bysex, csex, crace, cbirthm, cbirthd, cbirthyr)

# NLS student level data, containing variables about postsecondary education transcripts (PETS)
nls_stu_pets <- read_dta('https://github.com/ozanj/rclass/raw/master/data/nls72/nls72petsstu_v2.dta') %>%
  select(id, reqtrans, numtrans)

# NLS student-transcript level data
nls_tran <- read_dta('https://github.com/ozanj/rclass/raw/master/data/nls72/nls72petstrn_v2.dta') %>%
  select(id, transnum, findisp, trnsflag, terms, fice, state, cofcon, instype, itype)

# NLS student-transcript-term level data
nls_term <- read_dta('https://github.com/ozanj/rclass/raw/master/data/nls72/nls72petstrm_v2.dta') %>%
  select(id, transnum, termnum, courses, termtype, season, sortdate, gradcode, transfer)
Investigating student level nls_stu dataframe
id schcode bysex csex crace cbirthm cbirthd cbirthyr
18 3000 2 1 4 12 9 53
67 3000 1 1 2 10 14 53
83 2518 2 2 7 3 10 54
174 2911 1 1 7 5 11 54
190 800 2 2 7 1 5 55
# Get a feel for the data
names(nls_stu)
#> [1] "id"       "schcode"  "bysex"    "csex"     "crace"    "cbirthm"  "cbirthd" 
#> [8] "cbirthyr"
glimpse(nls_stu)
#> Rows: 22,652
#> Columns: 8
#> $ id       <dbl> 18, 67, 83, 174, 190, 232, 315, 380, 414, 430, 497, 521, 554,…
#> $ schcode  <dbl> 3000, 3000, 2518, 2911, 800, 7507, 3000, 9516, 2518, 2701, 28…
#> $ bysex    <dbl+lbl>  2,  1,  2,  1,  2,  2,  1,  1,  1,  1, 99,  1,  1,  2,  …
#> $ csex     <dbl+lbl> 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, …
#> $ crace    <dbl+lbl> 4, 2, 7, 7, 7, 7, 7, 7, 7, 4, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
#> $ cbirthm  <dbl> 12, 10, 3, 5, 1, 7, 3, 10, 9, 4, 6, 2, 5, 1, 10, 5, 9, 8, 12,…
#> $ cbirthd  <dbl> 9, 14, 10, 11, 5, 8, 1, 24, 3, 17, 11, 13, 28, 24, 9, 24, 20,…
#> $ cbirthyr <dbl> 53, 53, 54, 54, 55, 54, 54, 53, 53, 54, 54, 54, 54, 54, 54, 5…
nls_stu %>% var_label()
#> $id
#> [1] "unique school code of high school student attended"
#> 
#> $schcode
#> [1] "SCHOOL CODE"
#> 
#> $bysex
#> [1] "SEX OF STUDENT FROM BASE YEAR INSTRUMENT"
#> 
#> $csex
#> [1] "CSEX - SEX OF RESPONDENT"
#> 
#> $crace
#> [1] "RACE COMPOSITE"
#> 
#> $cbirthm
#> [1] "COMPOSITE BIRTH MONTH"
#> 
#> $cbirthd
#> [1] "COMPOSITE BIRTH DAY"
#> 
#> $cbirthyr
#> [1] "COMPOSITE BIRTH YEAR"
# We can investigate individual variables (e.g., bysex variable)
class(nls_stu$bysex)
#> [1] "haven_labelled" "vctrs_vctr"     "double"
str(nls_stu$bysex)
#>  dbl+lbl [1:22652]  2,  1,  2,  1,  2,  2,  1,  1,  1,  1, 99,  1,  1,  2, ...
#>  @ label       : chr "SEX OF STUDENT FROM BASE YEAR INSTRUMENT"
#>  @ format.stata: chr "%23.0g"
#>  @ labels      : Named num [1:4] 1 2 98 99
#>   ..- attr(*, "names")= chr [1:4] "1. male" "2. female" "98. {ILLEGITIMATE SKIP}" "99. {LEGITIMATE SKIP}"
nls_stu %>% select(bysex) %>% var_label()
#> $bysex
#> [1] "SEX OF STUDENT FROM BASE YEAR INSTRUMENT"
nls_stu %>% select(bysex) %>% val_labels()
#> $bysex
#>                 1. male               2. female 98. {ILLEGITIMATE SKIP} 
#>                       1                       2                      98 
#>   99. {LEGITIMATE SKIP} 
#>                      99
nls_stu %>% count(bysex)
#> # A tibble: 4 × 2
#>   bysex                            n
#>   <dbl+lbl>                    <int>
#> 1  1 [1. male]                  8208
#> 2  2 [2. female]                8340
#> 3 98 [98. {ILLEGITIMATE SKIP}]   135
#> 4 99 [99. {LEGITIMATE SKIP}]    5969
nls_stu %>% count(bysex) %>% as_factor()
#> # A tibble: 4 × 2
#>   bysex                       n
#>   <fct>                   <int>
#> 1 1. male                  8208
#> 2 2. female                8340
#> 3 98. {ILLEGITIMATE SKIP}   135
#> 4 99. {LEGITIMATE SKIP}    5969
Investigating student level nls_stu_pets dataframe
id reqtrans numtrans
18 1 0
67 1 1
83 1 0
315 2 2
414 1 0
names(nls_stu_pets)
#> [1] "id"       "reqtrans" "numtrans"
glimpse(nls_stu_pets)
#> Rows: 14,759
#> Columns: 3
#> $ id       <dbl> 18, 67, 83, 315, 414, 430, 802, 836, 935, 1040, 1057, 1099, 1…
#> $ reqtrans <dbl> 1, 1, 1, 2, 1, 1, 2, 2, 3, 2, 1, 3, 1, 2, 1, 3, 2, 3, 1, 1, 1…
#> $ numtrans <dbl> 0, 1, 0, 2, 0, 0, 1, 1, 2, 2, 1, 1, 1, 2, 1, 0, 2, 3, 1, 1, 1…
nls_stu_pets %>% var_label()
#> $id
#> [1] "unique student identification variable"
#> 
#> $reqtrans
#> [1] "NUMBER OF TRANSCRIPTS REQUESTED"
#> 
#> $numtrans
#> [1] "NUMBER OF TRANSCRIPTS RECEIVED"
Investigating student-transcript level nls_tran dataframe
id transnum findisp trnsflag terms fice state cofcon instype itype
18 1 6 0 99 1009 1 6 3 1
67 1 1 1 8 5694 1 2 1 5
83 1 3 0 99 1166 5 4 4 4
315 1 1 1 20 1009 1 6 3 1
315 2 1 1 1 1057 1 6 3 2
names(nls_tran)
#>  [1] "id"       "transnum" "findisp"  "trnsflag" "terms"    "fice"    
#>  [7] "state"    "cofcon"   "instype"  "itype"
glimpse(nls_tran)
#> Rows: 24,253
#> Columns: 10
#> $ id       <dbl> 18, 67, 83, 315, 315, 414, 430, 802, 802, 836, 836, 935, 935,…
#> $ transnum <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 3, 1, 2, 1, 1, 2, 3, 1…
#> $ findisp  <dbl+lbl> 6, 1, 3, 1, 1, 3, 4, 1, 2, 4, 1, 1, 1, 3, 1, 1, 1, 4, 4, …
#> $ trnsflag <dbl+lbl> 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, …
#> $ terms    <dbl> 99, 8, 99, 20, 1, 99, 99, 2, 99, 99, 6, 11, 6, 99, 1, 10, 8, …
#> $ fice     <dbl> 1009, 5694, 1166, 1009, 1057, 1166, 2089, 2865, 3687, 1666, 2…
#> $ state    <dbl> 1, 1, 5, 1, 1, 5, 21, 33, 46, 14, 23, 15, 23, 15, 5, 33, 11, …
#> $ cofcon   <dbl+lbl> 6, 2, 4, 6, 6, 4, 4, 4, 3, 5, 3, 5, 6, 1, 6, 5, 6, 4, 4, …
#> $ instype  <dbl+lbl> 3, 1, 4, 3, 3, 4, 4, 4, 2, 2, 2, 2, 3, 1, 3, 2, 3, 4, 4, …
#> $ itype    <dbl+lbl> 1, 5, 4, 1, 2, 4, 4, 4, 3, 6, 4, 2, 2, 5, 1, 1, 2, 4, 4, …
nls_tran %>% var_label()
#> $id
#> [1] "unique student identification variable"
#> 
#> $transnum
#> [1] "TRANSCRIPT NUMBER"
#> 
#> $findisp
#> [1] "FINAL DISPOSITION"
#> 
#> $trnsflag
#> [1] "TRANSCRIPT FLAG"
#> 
#> $terms
#> [1] "NUMBER OF TERMS ON THIS TRANSCRIPT"
#> 
#> $fice
#> [1] "POSTSECONDARY INSTITUTION ID CODE"
#> 
#> $state
#> [1] "state"
#> 
#> $cofcon
#> [1] "OFFERING & CONTROL"
#> 
#> $instype
#> [1] "INSTITUTION TYPE"
#> 
#> $itype
#> [1] "INSTITUTION TYPE"
nls_tran %>% val_labels()
#> $id
#> NULL
#> 
#> $transnum
#> NULL
#> 
#> $findisp
#>      1. TRANSCRIPT RECEIVED           2. SCHOOL REFUSED 
#>                           1                           2 
#>   3. STUDENT NEVER ATTENDED 4. SCHOOL LOST OR DESTROYED 
#>                           3                           4 
#>            5. SCHOOL CLOSED  6. NO RESPONSE FROM SCHOOL 
#>                           5                           6 
#> 
#> $trnsflag
#>     0. DUMMY TRANSCRIPT 1. REQUESTED & RECEIVED 
#>                       0                       1 
#> 
#> $terms
#> NULL
#> 
#> $fice
#> NULL
#> 
#> $state
#> NULL
#> 
#> $cofcon
#>          1. proprietary      2. PUBLIC < 2-YEAR 3. PRIVATE NFP < 4-YEAR 
#>                       1                       2                       3 
#>        4. PUBLIC 2-YEAR   5. PRIVATE NFP 4-YEAR        6. PUBLIC 4-YEAR 
#>                       4                       5                       6 
#> 
#> $instype
#>   1. proprietary   2. PRIVATE NFP 3. PUBLIC 4-YEAR 4. PUBLIC 2-YEAR 
#>                1                2                3                4 
#>       5. foreign 
#>                5 
#> 
#> $itype
#>   1. Research & Doctoral         2. Comprehensive          3. Liberal Arts 
#>                        1                        2                        3 
#>                4. 2-year      5. Less than 2-year           6. Specialized 
#>                        4                        5                        6 
#> 8. Special comprehensive 
#>                        8
Investigating student-transcript-term level nls_term dataframe
id transnum termnum courses termtype season sortdate gradcode transfer
67 1 1 3 4 2 8101 1 0
67 1 2 2 4 3 8104 1 0
67 1 3 2 4 1 8109 1 0
67 1 4 2 4 2 8201 1 0
67 1 5 2 4 3 8204 1 0
names(nls_term)
#> [1] "id"       "transnum" "termnum"  "courses"  "termtype" "season"   "sortdate"
#> [8] "gradcode" "transfer"
glimpse(nls_term)
#> Rows: 120,885
#> Columns: 9
#> $ id       <dbl> 67, 67, 67, 67, 67, 67, 67, 67, 315, 315, 315, 315, 315, 315,…
#> $ transnum <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ termnum  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
#> $ courses  <dbl> 3, 2, 2, 2, 2, 2, 2, 2, 4, 3, 5, 4, 4, 4, 4, 5, 1, 4, 5, 5, 5…
#> $ termtype <dbl+lbl> 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
#> $ season   <dbl+lbl> 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 4, …
#> $ sortdate <dbl> 8101, 8104, 8109, 8201, 8204, 8207, 8209, 8301, 7209, 7301, 7…
#> $ gradcode <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ transfer <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
nls_term %>% var_label()
#> $id
#> [1] "unique student identification variable"
#> 
#> $transnum
#> [1] "TRANSCRIPT NUMBER"
#> 
#> $termnum
#> [1] "TERM NUMBER"
#> 
#> $courses
#> [1] "NUMBER OF COURSES TERM"
#> 
#> $termtype
#> [1] "TYPE OF TERM"
#> 
#> $season
#> [1] "SEASON OF TERM"
#> 
#> $sortdate
#> [1] "DATE OF TERM"
#> 
#> $gradcode
#> [1] "GRADE SCALE TYPE"
#> 
#> $transfer
#> [1] "TRANSFER COURSES FLAG"
nls_term %>% val_labels()
#> $id
#> NULL
#> 
#> $transnum
#> NULL
#> 
#> $termnum
#> NULL
#> 
#> $courses
#> NULL
#> 
#> $termtype
#> 1. VARIABLE LENGTH OR NONCOURSE TERM                          2. semester 
#>                                    1                                    2 
#>                         3. trimester                           4. quarter 
#>                                    3                                    4 
#>             5. CREDIT BY EXAMINATION                         7. {UNKNOWN} 
#>                                    5                                    7 
#> 
#> $season
#>      1. fall    2. winter    3. spring    4. summer 9. {MISSING} 
#>            1            2            3            4            9 
#> 
#> $sortdate
#> NULL
#> 
#> $gradcode
#>  1. LETTER GRADES 2. NUMERIC GRADES 
#>                 1                 2 
#> 
#> $transfer
#>  0. NOT TRANSFER 1. TRANSFER TERM 
#>                0                1

1.2 Lecture overview

It is rare for an analysis dataset to consist of data from only one input dataset. For most projects, each analysis dataset contains data from multiple data sources. Therefore, you must become proficient in combining data from multiple data sources.

Two broad topics:

  1. Joining datasets [big topic]
    • Combine two datasets so that the resulting dataset has additional variables
    • The term “join” comes from the relational databases world; Social science research often uses the term “merge” rather than “join”
  2. Appending datasets [small topic]
    • Stack datasets on top of one another so that resulting dataset has more observations, but (typically) the same number of variables
    • Often, longitudinal datasets are created by appending datasets

Wickham differentiates between mutating joins and filtering joins:

  • Mutating joins “add new variables to one data frame from matching observations in another”
  • Filtering joins “filter observations from one data frame based on whether or not they match an observation in the other table”
    • Doesn’t add new variables
    • Usually, the purpose is to check the quality of mutating joins (e.g., which observations didn’t match)

Our main focus today is on mutating joins. But filtering joins are useful for data quality checks of mutating joins.

1.3 Relational databases and tables

Traditionally, social scientists store data in “flat files”

  • Flat files are “rectangular” datasets consisting of columns (usually called variables) and rows (usually called observations)
  • When we want to incorporate variables from two flat files, we “merge” them together

The rest of the world stores data in “relational databases”

  • A relational database consists of multiple tables
  • Each table is a flat file
  • A goal of relational databases is to store data using the minimum possible space; therefore, a rule is to never duplicate information across tables
  • When you need information from multiple tables, you “join” (the database term for “merge”) tables “on the fly” rather than creating some permanent flat file that contains data from multiple tables
  • Each table in a relational database has a different “observational level”
    • For example, NLS72 has a student level table, a student-transcript level table, etc.
    • From the perspective of a database person, it wouldn’t make sense to store student level variables (e.g., birth-date) on the student-transcript level table because student birth-date does not vary by transcript, so this would result in needless duplication of data
  • Structured Query Language (SQL) is the universally-used programming language for relational databases

Real-world examples of relational databases

  • iTunes
    • Behind the scenes, there are separate tables for artist, album, song, genre, etc.
    • The “library” you see as a user is the result of “on the fly” joins of these tables
  • Every social media app (e.g., X (formerly Twitter), FB, Instagram) you use is a relational database
    • What you see as a user is the result of “on the fly” joins of individual tables running in the background
  • Interactive maps typically have relational databases running in the background
    • Clicking on some part of the map triggers a join of tables and you see the result of some analysis based on that join
    • E.g., our off-campus recruiting map

Should you think of combining dataframes in R as “merging” flat files or “joining” tables of a relational database?

  • Can think of it either way; but I think better to think of it both ways
  • For example, you can think of the NLS72 datasets as:
    • A bunch of flat files that we merge
    • OR a set of tables that comprise a relational database
  • Although we are combining flat files, tidyverse uses the terminology (e.g, “keys”, “join”) of relational databases for doing so

2 Keys

What are keys?

  • Keys are “the variables used to connect each pair of tables” [see here]
  • Even though relational databases often consist of many tables, relations are always defined between a pair of tables
  • When joining tables, focus on joining one table to another; you make this “join” using the key variable(s) that define the relationship between these two tables
  • Even when your analysis requires variables from more than two tables, you proceed by joining one pair of tables at a time


There are two types of keys: primary keys and foreign keys

  • “The main reason for primary and foreign keys is to enforce data consistency” [see here]
  • Although primary and foreign keys do not technically need to be defined in order to perform a join, they are important for maintaining database integrity

2.1 Primary keys

What is a primary key?

  • A primary key uniquely identifies an observation in its own table
  • It is a variable (or combination of variables) in a table that uniquely identifies observations in its own table


Example: id is the primary key in the nls_stu table

The variable id (“unique student identification variable”) uniquely identifies observations in the dataset nls_stu. In other words, no two observations in nls_stu have the same value of id.

Let’s confirm that each value of id is associated with only one observation:

nls_stu %>% group_by(id) %>%  # group by primary key
  summarise(n_per_id=n()) %>%  # create a measure of number of observations per group
  ungroup() %>%  # ungroup, otherwise frequency table [next step] created separately for each group
  count(n_per_id)  # frequency of number of observations per group
#> # A tibble: 1 × 2
#>   n_per_id     n
#>      <int> <int>
#> 1        1 22652

An alternative way to confirm that no values of id have more than one observation:

nls_stu %>% 
  count(id) %>%  # create object that counts the number of obs for each value of id
  filter(n>1)  # keep only rows where count of obs per id is greater than 1
#> # A tibble: 0 × 2
#> # ℹ 2 variables: id <dbl>, n <int>


Example: id and transnum make up the primary key in the nls_tran table

Sometimes, more than one variable is required to make up the primary key in a table. For example, in the student-transcript level table nls_tran, both id and transnum are needed to uniquely identify the observations in the table:

nls_tran %>%
  group_by(id, transnum) %>%
  summarise(n_per_key=n()) %>% 
  ungroup() %>% 
  count(n_per_key)
#> # A tibble: 1 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1 24253

2.1.1 Student exercise

Which variable(s) make up the primary key in the nls_term table?

Solution

The variables id, transnum, and termnum make up the primary key for the nls_term table because these variables together uniquely identify each observation in the table:

nls_term %>% 
  group_by(id, transnum, termnum) %>% 
  summarise(n_per_key=n()) %>% 
  ungroup() %>% 
  count(n_per_key)
#> # A tibble: 1 × 2
#>   n_per_key      n
#>       <int>  <int>
#> 1         1 120885

2.2 Foreign keys

What is a foreign key?

  • A foreign key uniquely identifies an observation in another table
  • Said differently, it is a variable (or combination of variables) in a table that is the primary key in another table
  • A foreign key creates a relationship between two tables


Example: With respect to datasets nls_stu and nls_tran

  • Recall that the student id is the primary key in nls_stu because it uniquely identifies each observation in the table.
  • The variable nls_tran$id is the foreign key for nls_stu because it uniquely identifies observations in nls_stu
nls_tran %>% group_by(id) %>% summarise(n_per_key=n()) %>% ungroup %>% count(n_per_key)
#> # A tibble: 7 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1  8022
#> 2         2  4558
#> 3         3  1681
#> 4         4   415
#> 5         5    71
#> 6         6     6
#> 7         7     3
nls_stu %>% group_by(id) %>% summarise(n_per_key=n()) %>% ungroup %>% count(n_per_key)
#> # A tibble: 1 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1 22652


Ozan’s way of thinking about foreign key

Personally, I find the concept foreign key a little bit slippery. Here is how I wrap my head around it:

  • First, always remember that “joins” happen between two specific tables, so have two specific tables in mind
  • Second, to understand foreign key concept, I think of a “focal table” [my term] (e.g., nls_tran) and some “other table” (e.g., nls_stu)
  • Third, then, the foreign key is a variable (or combination of variables) that satisfies two conditions:
    1. Exists in the “focal table” (but may or may not be the primary key for the focal table)
    2. Exists in the “other table” AND is the primary key for that “other table”

Example of foreign key

  • With respect to the “focal table” nls_tran and the “other table” nls_stu, the variable id is the foreign key because:
    • id exists in the “focal table” nls_trans (though it does not uniquely identifies observations in nls_trans)
    • id exists in the “other table” nls_stu and id uniquely identifies observations in nls_stu

2.2.1 Student exercise

With respect to nls_trans, which variable(s) make up the foreign key in the nls_term table?

Solution

Since id and transnum make up the primary key in the nls_trans table, these variables make up the foreign key in nls_term because they can be used to link the two tables:

# id and transnum make up the primary key in nls_trans because they uniquely identify each obs
nls_tran %>% 
  group_by(id, transnum) %>% 
  summarise(n_per_key=n()) %>% 
  ungroup() %>% 
  count(n_per_key)
#> # A tibble: 1 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1 24253
# id and transnum also both exist in nls_term, so they form the foreign key that links the 2 tables
names(nls_term)
#> [1] "id"       "transnum" "termnum"  "courses"  "termtype" "season"   "sortdate"
#> [8] "gradcode" "transfer"

2.3 Requirements for joining two tables

In practice, you join two tables without explicit thinking about “primary key” vs. “foreign key” and “focal table” vs. “other table”

  • Doesn’t matter which data frame you “start with” (e.g., as the “focal table”)

The only requirements for joining are:

  1. One of the two data frames have a primary key (variable or combination of variables that uniquely identify observations) AND
  2. That variable or combination of variables is available in the other of the two data frames

With this in mind, let’s examine the dataframes nls_stu and nls_tran

# requirement 1: at least one of the two data frames must have a primary key
  # let's check if id is the primary key in nls_stu
nls_stu %>% group_by(id) %>% summarise(n_per_key=n()) %>% ungroup %>% count(n_per_key)
#> # A tibble: 1 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1 22652
# requirement 2: that primary key (id) is available in the other of the two data frames
names(nls_tran)
#>  [1] "id"       "transnum" "findisp"  "trnsflag" "terms"    "fice"    
#>  [7] "state"    "cofcon"   "instype"  "itype"
nls_tran %>% arrange(id, transnum) %>% select(id,transnum,findisp,itype) %>% head(20)
#> # A tibble: 20 × 4
#>       id transnum findisp                         itype                     
#>    <dbl>    <dbl> <dbl+lbl>                       <dbl+lbl>                 
#>  1    18        1 6 [6. NO RESPONSE FROM SCHOOL]  1 [1. Research & Doctoral]
#>  2    67        1 1 [1. TRANSCRIPT RECEIVED]      5 [5. Less than 2-year]   
#>  3    83        1 3 [3. STUDENT NEVER ATTENDED]   4 [4. 2-year]             
#>  4   315        1 1 [1. TRANSCRIPT RECEIVED]      1 [1. Research & Doctoral]
#>  5   315        2 1 [1. TRANSCRIPT RECEIVED]      2 [2. Comprehensive]      
#>  6   414        1 3 [3. STUDENT NEVER ATTENDED]   4 [4. 2-year]             
#>  7   430        1 4 [4. SCHOOL LOST OR DESTROYED] 4 [4. 2-year]             
#>  8   802        1 1 [1. TRANSCRIPT RECEIVED]      4 [4. 2-year]             
#>  9   802        2 2 [2. SCHOOL REFUSED]           3 [3. Liberal Arts]       
#> 10   836        1 4 [4. SCHOOL LOST OR DESTROYED] 6 [6. Specialized]        
#> 11   836        2 1 [1. TRANSCRIPT RECEIVED]      4 [4. 2-year]             
#> 12   935        1 1 [1. TRANSCRIPT RECEIVED]      2 [2. Comprehensive]      
#> 13   935        2 1 [1. TRANSCRIPT RECEIVED]      2 [2. Comprehensive]      
#> 14   935        3 3 [3. STUDENT NEVER ATTENDED]   5 [5. Less than 2-year]   
#> 15  1040        1 1 [1. TRANSCRIPT RECEIVED]      1 [1. Research & Doctoral]
#> 16  1040        2 1 [1. TRANSCRIPT RECEIVED]      1 [1. Research & Doctoral]
#> 17  1057        1 1 [1. TRANSCRIPT RECEIVED]      2 [2. Comprehensive]      
#> 18  1099        1 4 [4. SCHOOL LOST OR DESTROYED] 4 [4. 2-year]             
#> 19  1099        2 4 [4. SCHOOL LOST OR DESTROYED] 4 [4. 2-year]             
#> 20  1099        3 1 [1. TRANSCRIPT RECEIVED]      5 [5. Less than 2-year]
#nls_tran %>% group_by(id) %>% summarise(n_per_key=n()) %>% ungroup %>% count(n_per_key)

3 Mutating joins

Following Wickham, we’ll explain joins using the hypothetical tables x and y:

Note: Credit for examples and images belong to Wickham


What is a join?

  • A join “is a way of connecting each row in x to zero, one, or more rows in y
  • Observations in table x matched to observations in table y using a “key” variable
  • A key is a variable (or combination of variables) that exist in both tables and uniquely identifies observations in at least one of the two tables


There are four types of joins between tables x and y:

  • inner join: keep all observations that appear in both table x and table y
  • left join: keep all observations in x (regardless of whether these obs appear in y)
  • right join: keep all observations in y (regardless of whether these obs appear in x)
  • full join: keep all observations that appear in x or in y

The last three joins – left, right, full – keep observations that appear in at least one table and are collectively referred to as outer joins.


We will join tables x and y using the join() command from dplyr package. join() is a general command, which has more specific commands for each type of join:

  • inner_join()
  • left_join()
  • right_join()
  • full_join()

3.1 Inner joins

Inner joins keep all observations that appear in both table x and table y (i.e., keep observations 1 and 2):


The inner_join() function:

?inner_join

# SYNTAX AND DEFAULT VALUES
inner_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  na_matches = c("na", "never")
)
  • Function: Perform inner join between dataframes x and y
  • Arguments:
    • x, y: Dataframes to join
    • by: Variable(s) to join by
      • If joining by single variable: inner_join(x, y, by = "keyvar")
      • If joining by multiple variables: inner_join(x, y, by = c("keyvar1", "keyvar2", ...))
      • If variables to join by have different names: inner_join(x, y, by = c("keyvarx" = "keyvary", ...))


Example: Using inner_join() to join x and y

Table `x`
key val_x
1 x1
2 x2
3 x3
Table `y`
key val_y
1 y1
2 y2
4 y3

We want to join x and y by the common variable key:

inner_join(x, y, by = "key")  # Equivalent to: inner_join(x, y, by = c("key" = "key"))
#> # A tibble: 2 × 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2
inner_join(x = x, y = y, by = "key")
#> # A tibble: 2 × 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2


This is an example of a one-to-one relationship because key uniquely identifies observations in both table x and y. In other words, key can be the primary key for both x and y, as well as foreign key when we think about the relationship between the tables.


Example: One-to-one inner join between nls_stu and nls_stu_pets

Recall that nls_stu and nls_stu_pets are both student level data

  • nls_stu contains info on student demographics and nls_stu_pets contains info on student postsecondary education transcripts (PETS). It makes sense that we’ll want to join the tables by student, namely using the student id variable. This will be a one-to-one join because id uniquely identifies the observations in both nls_stu and nls_stu_pets:
nls_stu %>% group_by(id) %>% summarise(n_per_id=n()) %>% ungroup() %>%
  count(n_per_id) 
#> # A tibble: 1 × 2
#>   n_per_id     n
#>      <int> <int>
#> 1        1 22652
nls_stu_pets %>% group_by(id) %>% summarise(n_per_id=n()) %>% ungroup() %>%
  count(n_per_id) 
#> # A tibble: 1 × 2
#>   n_per_id     n
#>      <int> <int>
#> 1        1 14759
# Join the nls_stu and nls_stu_pets by id variable
nls_stu_stu <- nls_stu %>% inner_join(nls_stu_pets, by = "id")

# Joined dataframe now have variables from both nls_stu and nls_stu_pets
names(nls_stu_stu)
#>  [1] "id"       "schcode"  "bysex"    "csex"     "crace"    "cbirthm" 
#>  [7] "cbirthd"  "cbirthyr" "reqtrans" "numtrans"
# Print a few obs and selected variables from joined dataframe
nls_stu_stu %>% select(id, bysex, crace, reqtrans, numtrans) %>% as_factor()
#> # A tibble: 14,759 × 5
#>       id bysex                 crace           reqtrans numtrans
#>    <dbl> <fct>                 <fct>              <dbl>    <dbl>
#>  1    18 2. female             4. PUERTO RICAN        1        0
#>  2    67 1. male               2. black               1        1
#>  3    83 2. female             7. white               1        0
#>  4   315 1. male               7. white               2        2
#>  5   414 1. male               7. white               1        0
#>  6   430 1. male               4. PUERTO RICAN        1        0
#>  7   802 2. female             7. white               2        1
#>  8   836 1. male               7. white               2        1
#>  9   935 99. {LEGITIMATE SKIP} 7. white               3        2
#> 10  1040 1. male               7. white               2        2
#> # ℹ 14,749 more rows
# Compare number of obs in input datasets to joined dataset
nrow(nls_stu)
#> [1] 22652
nrow(nls_stu_pets)
#> [1] 14759
# Remember inner join only keeps obs that are common in both tables, so it makes sense that the joined dataset cannot have more rows than either input datasets
nrow(nls_stu_stu)
#> [1] 14759

3.1.1 Duplicate keys

There are times when the key variable(s) we join by do not uniquely identify observations in both the tables. For example, the table on the right (y) below has unique values for key but the table on the left (x) contains multiple observations with the same value:

This represents a one-to-many relationship, where one observation in y matches to multiple observations in x. Since key uniquely identifies observations in y, it is the primary key in table y and the foreign key in table x.


Example: Using inner_join() to join x and y

Table `x`
key val_x
1 x1
2 x2
2 x3
1 x4
Table `y`
key val_y
1 y1
2 y2

We can confirm that x and y have a many-to-one relationship because key does not uniquely identify observations in x but does uniquely identify observations in y:

x %>% group_by(key) %>% summarise(n_per_key=n()) %>% ungroup() %>% count(n_per_key)
#> # A tibble: 1 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         2     2
y %>% group_by(key) %>% summarise(n_per_key=n()) %>% ungroup() %>% count(n_per_key)
#> # A tibble: 1 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1     2

We want to join x and y by the common variable key:

inner_join(x, y, by = "key")  # Equivalent to: inner_join(x, y, by = c("key" = "key"))
#> # A tibble: 4 × 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2   
#> 3     2 x3    y2   
#> 4     1 x4    y1


Summary of the types of relationships

General rule of thumb for joining two tables:

  • Key variable should uniquely identify observations in at least one of the tables you are joining

Depending on whether the key variable uniquely identifies observations in table x and/or table y you will have:

  • one-to-one join: key variable uniquely identifies observations in table x and uniquely identifies observations in table y
    • The join between nls_stu and nls_stu_pets was a one-to-one join; the variable id uniquely identifies observations in both tables
    • In the relational database world one-to-one joins are rare and are considered special cases of one-to-many or many-to-one joins
      • Why? if tables can be joined via one-to-one join, then they should already be part of the same table.
  • one-to-many join: key variable uniquely identifies observations in table x and does not uniquely identify observations in table y
    • Each observation from table x may match to multiple observations from table y
    • E.g., Joining nls_stu and nls_trans: one observation (student) in nls_stu has many observations in nls_trans (transcripts)
  • many-to-one join: key variable does not uniquely identify observations in table x and does uniquely identify observations in table y
    • Each observation from table y may match to multiple observations from table x
    • E.g., Joining nls_trans and nls_stu: many observations (transcripts) in nls_trans have one observation in nls_stu (student)
  • many-to-many join: key variable does not uniquely identify observations in table x and does not uniquely identify observations in table y
    • This is usually an error


Example: One-to-many inner join between nls_stu and nls_trans

Recall that nls_stu contains student level demographics data, where the student id variable uniquely identifies each observation in the table. nls_trans on the other hand is student-transcript level data, so id does not uniquely identify each observation. Thus, if we join by id, it would be a one-to-many relationship:

# id uniquely identify obs in nls_stu
nls_stu %>% group_by(id) %>% summarise(n_per_key=n()) %>% ungroup() %>%
  count(n_per_key) 
#> # A tibble: 1 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1 22652
# id does not uniquely identify obs in nls_tran
nls_tran %>% group_by(id) %>% summarise(n_per_key=n()) %>% ungroup() %>%
  count(n_per_key) 
#> # A tibble: 7 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1  8022
#> 2         2  4558
#> 3         3  1681
#> 4         4   415
#> 5         5    71
#> 6         6     6
#> 7         7     3
# Join the nls_stu and nls_tran by id variable
nls_stu_tran <- nls_stu %>% inner_join(nls_tran, by = "id")

# Joined dataframe now have variables from both nls_stu and nls_tran
names(nls_stu_tran)
#>  [1] "id"       "schcode"  "bysex"    "csex"     "crace"    "cbirthm" 
#>  [7] "cbirthd"  "cbirthyr" "transnum" "findisp"  "trnsflag" "terms"   
#> [13] "fice"     "state"    "cofcon"   "instype"  "itype"
# Print a few obs and selected variables from joined dataframe
nls_stu_tran %>% select(id, transnum, bysex, findisp, trnsflag) %>% as_factor()
#> # A tibble: 24,253 × 5
#>       id transnum bysex     findisp                     trnsflag               
#>    <dbl>    <dbl> <fct>     <fct>                       <fct>                  
#>  1    18        1 2. female 6. NO RESPONSE FROM SCHOOL  0. DUMMY TRANSCRIPT    
#>  2    67        1 1. male   1. TRANSCRIPT RECEIVED      1. REQUESTED & RECEIVED
#>  3    83        1 2. female 3. STUDENT NEVER ATTENDED   0. DUMMY TRANSCRIPT    
#>  4   315        1 1. male   1. TRANSCRIPT RECEIVED      1. REQUESTED & RECEIVED
#>  5   315        2 1. male   1. TRANSCRIPT RECEIVED      1. REQUESTED & RECEIVED
#>  6   414        1 1. male   3. STUDENT NEVER ATTENDED   0. DUMMY TRANSCRIPT    
#>  7   430        1 1. male   4. SCHOOL LOST OR DESTROYED 0. DUMMY TRANSCRIPT    
#>  8   802        1 2. female 1. TRANSCRIPT RECEIVED      1. REQUESTED & RECEIVED
#>  9   802        2 2. female 2. SCHOOL REFUSED           0. DUMMY TRANSCRIPT    
#> 10   836        1 1. male   4. SCHOOL LOST OR DESTROYED 0. DUMMY TRANSCRIPT    
#> # ℹ 24,243 more rows
# Compare number of obs in input datasets to joined dataset
nrow(nls_stu)
#> [1] 22652
nrow(nls_tran)
#> [1] 24253
# Since each id in nls_stu may match to more than one observation in nls_tran, it makes sense that the joined dataset may have more rows than nls_stu
nrow(nls_stu_tran)
#> [1] 24253

3.1.2 Joining by multiple variables

Thus far, tables have been joined by a single “key” variable using this syntax:

  • inner_join(x, y, by = "keyvar")

Often, multiple variables form the “key”. Specify this using this syntax:

  • inner_join(x, y, by = c("keyvar1", "keyvar2", ...))


Example: Joining nls_tran and nls_term by multiple variables

Recall that the student-transcript level dataset nls_tran is uniquely identified by the combination of two variables: id and transnum. Together, these variables form the primary key of nls_tran and foreign key of nls_term, the table we want to join nls_tran with.

# Join nls_tran and nls_term by multiple variables
inner_join(nls_tran, nls_term, by = c("id", "transnum"))
#> # A tibble: 120,807 × 17
#>       id transnum findisp     trnsflag terms  fice state cofcon  instype itype  
#>    <dbl>    <dbl> <dbl+lbl>   <dbl+lb> <dbl> <dbl> <dbl> <dbl+l> <dbl+l> <dbl+l>
#>  1    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  2    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  3    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  4    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  5    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  6    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  7    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  8    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  9   315        1 1 [1. TRAN… 1 [1. R…    20  1009     1 6 [6. … 3 [3. … 1 [1. …
#> 10   315        1 1 [1. TRAN… 1 [1. R…    20  1009     1 6 [6. … 3 [3. … 1 [1. …
#> # ℹ 120,797 more rows
#> # ℹ 7 more variables: termnum <dbl>, courses <dbl>, termtype <dbl+lbl>,
#> #   season <dbl+lbl>, sortdate <dbl>, gradcode <dbl+lbl>, transfer <dbl+lbl>
# Equivalent to: inner_join(nls_tran, nls_term, by = c("id" = "id", "transnum" = "transnum"))

3.2 Outer joins

Outer joins keep observations that appear in at least one table. Recall tables x and y:


To perform each of the 3 types of outer joins:

  • Left join: keep all observations in x, regardless of whether they appear in y (i.e., keep observations 1, 2, and 3)
  • Right join: keep all observations in y, regardless of whether they appear in x (i.e., keep observations 1, 2, and 4)
  • Full join: keep all observations that appear in x or in y (i.e., keep observations 1, 2, 3, and 4)

Note that any unmatched rows will have NA values in the column. As explained by Wickham, “These joins work by adding an additional ‘virtual’ observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with NA.”


Example: Left joins between nls_stu, nls_stu_pets, and nls_tran

The left join is the most commonly used outer join in social science research (more common than inner join too). This is because we often start with some dataset x (e.g., nls_stu) and we want to add variables from dataset y.

  • Usually, we want to keep observations from x regardless of whether they match with y
  • Usually, we are uninterested in observations from y that did not match with x

In this example, we will:

  • Start with nls_stu
  • Perform a left join with nls_stu_pets
  • Then perform a left join with nls_tran
# Recall that id uniquely identifies both nls_stu and nls_stu_pets
nls_stu %>% group_by(id) %>% summarise(n_per_key=n()) %>% ungroup() %>%
  count(n_per_key)
#> # A tibble: 1 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1 22652
nls_stu_pets %>% group_by(id) %>% summarise(n_per_key=n()) %>% ungroup() %>%
  count(n_per_key)
#> # A tibble: 1 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1 14759
# Left join nls_stu with nls_stu_pets (one-to-one)
nls_stu_pets_stu <- nls_stu %>% left_join(nls_stu_pets, by = "id")

# Investigate data structure of merged object
nrow(nls_stu_pets)
#> [1] 14759
nrow(nls_stu)
#> [1] 22652
nrow(nls_stu_pets_stu)
#> [1] 22652
# Recall that id and transnum (not id alone) uniquely identifies nls_tran
nls_tran %>% group_by(id, transnum) %>% summarise(n_per_key=n()) %>% ungroup() %>% count(n_per_key)
#> # A tibble: 1 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1 24253
# Left join nls_stu_pets_stu with nls_tran (one-to-many)
nls_stu_pets_stu_tran <- nls_stu_pets_stu %>% left_join(nls_tran, by = "id")

# Investigate data structure of merged object
nrow(nls_stu_pets_stu)
#> [1] 22652
nrow(nls_tran)
#> [1] 24253
nrow(nls_stu_pets_stu_tran)
#> [1] 32149

4 Filtering joins

What are filtering joins?

  • Filtering joins filter rows from x based on the presence or absence of matches in y
  • Unlike mutating joins which results in an object with variables (columns) from both x and y, filtering joins only retains variables from x
  • In other words, filtering joins keeps observations (rows) from the x dataframe depending on whether or not it has a match in y


There are two types of filtering joins:

  • anti_join(x, y): return all rows from x without a match in y
  • semi_join(x, y): return all rows from x with a match in y


Example: Using anti_join() and semi_join()

Imagine we have the following tables x and y:

Table `x`
key val_x
1 x1
2 x2
2 x3
3 x5
4 x6
5 x7
6 x8
Table `y`
key val_y
1 y1
1 y2
2 y3
3 y4
4 y5
9 y6

Semi join returns all rows from x where there are matching values in y: semi_join(x, y, by = "key")

Semi join
key val_x
1 x1
2 x2
2 x3
3 x5
4 x6

Anti join returns all rows from x where there are no matching values in y: anti_join(x, y, by = "key")

Anti join
key val_x
5 x7
6 x8

We can see which rows from x have or do not have a match in y from the left join: left_join(x, y, by = "key")

Left join
key val_x val_y
1 x1 y1
1 x1 y2
2 x2 y3
2 x3 y3
3 x5 y4
4 x6 y5
5 x7 NA
6 x8 NA

Note that the row from x where key == 1 and val_x == x1 appears twice in the left join because it has a one-to-many relationship with y. However, this row is not duplicated in the semi join.


Example: Using anti_join() to diagnose mismatches in mutating joins

A primary use of filtering joins is as an investigative tool to diagnose problems with mutating joins. We can use anti_join() to investigate the rows that did not have a match during the join.

For example, consider an inner join between nls_tran, which contains info on postsecondary transcripts, and nls_term, which contains term level info for each transcript:

# id and transnum uniquely identify obs in nls_trans
nls_tran %>% group_by(id, transnum) %>% summarise(n_per_key=n()) %>% ungroup() %>% count(n_per_key)
#> # A tibble: 1 × 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1 24253
# There are 24253 unique student transcripts (rows) in nls_tran
nls_tran %>% nrow()
#> [1] 24253
# inner_join
nls_tran %>% inner_join(nls_term, by = c("id", "transnum"))
#> # A tibble: 120,807 × 17
#>       id transnum findisp     trnsflag terms  fice state cofcon  instype itype  
#>    <dbl>    <dbl> <dbl+lbl>   <dbl+lb> <dbl> <dbl> <dbl> <dbl+l> <dbl+l> <dbl+l>
#>  1    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  2    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  3    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  4    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  5    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  6    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  7    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  8    67        1 1 [1. TRAN… 1 [1. R…     8  5694     1 2 [2. … 1 [1. … 5 [5. …
#>  9   315        1 1 [1. TRAN… 1 [1. R…    20  1009     1 6 [6. … 3 [3. … 1 [1. …
#> 10   315        1 1 [1. TRAN… 1 [1. R…    20  1009     1 6 [6. … 3 [3. … 1 [1. …
#> # ℹ 120,797 more rows
#> # ℹ 7 more variables: termnum <dbl>, courses <dbl>, termtype <dbl+lbl>,
#> #   season <dbl+lbl>, sortdate <dbl>, gradcode <dbl+lbl>, transfer <dbl+lbl>
# figure out number of obs from nls_tran that merged w/ nls_term
nls_tran %>% inner_join(nls_term, by = c("id", "transnum")) %>% 
  group_by(id,transnum) %>% summarize() %>% # keep one obs per each combination of id,transnum
  nrow() # number of number of obs from nls_tran that had merged with nls_term by id,transnum
#> [1] 18855


As seen, 18855 of the 24253 student transcripts (rows) in nls_tran have a match in nls_term, leaving a remaining 5398 student transcripts that did not match to any term level info. We can use anti_join() to investigate these unmatched rows to see why these transcripts do not have any term level data.

# Create object of observations that didn't merge
tran_term_anti <- nls_tran %>% anti_join(nls_term, by = c("id", "transnum"))

# There are 5398 unmatched student transcripts from nls_tran
nrow(tran_term_anti)
#> [1] 5398
# We can investigate the findisp variable of the unmatched rows
attributes(tran_term_anti$findisp)
#> $label
#> [1] "FINAL DISPOSITION"
#> 
#> $format.stata
#> [1] "%27.0g"
#> 
#> $labels
#>      1. TRANSCRIPT RECEIVED           2. SCHOOL REFUSED 
#>                           1                           2 
#>   3. STUDENT NEVER ATTENDED 4. SCHOOL LOST OR DESTROYED 
#>                           3                           4 
#>            5. SCHOOL CLOSED  6. NO RESPONSE FROM SCHOOL 
#>                           5                           6 
#> 
#> $class
#> [1] "haven_labelled" "vctrs_vctr"     "double"
# None of the unmatched rows has "TRANSCRIPT RECEIVED" status, which could explain why they had no match
tran_term_anti %>% select(findisp) %>% count(findisp) %>% as_factor()
#> # A tibble: 5 × 2
#>   findisp                         n
#>   <fct>                       <int>
#> 1 2. SCHOOL REFUSED             272
#> 2 3. STUDENT NEVER ATTENDED    2565
#> 3 4. SCHOOL LOST OR DESTROYED  1113
#> 4 5. SCHOOL CLOSED              210
#> 5 6. NO RESPONSE FROM SCHOOL   1238

5 Appending data

Appending data involve “stacking” multiple datasets (which typically have the same variables) on top of one another.


The bind_rows() function:

?bind_rows

# SYNTAX AND DEFAULT VALUES
bind_rows(..., .id = NULL)
  • Function: Bind rows of multiple dataframes into one
  • Arguments:
    • ...: Dataframes to bind


Example: Using bind_rows() to bind time1 and time2

The most common practical use of stacking is creating “longitudinal dataset” when input data are released separately for each time period. Longitudinal data has one row per time period for a person/place/observation.

Imagine we have the following tables time1 and time2:

Table `time1`
id year income
1 2017 50
2 2017 100
3 2017 200
Table `time2`
id year income
1 2018 70
2 2018 120
3 2018 220


Use bind_rows() to stack the dataframes on top of one another:

bind_rows(time1, time2)
id year income
1 2017 50
2 2017 100
3 2017 200
1 2018 70
2 2018 120
3 2018 220


Example: Creating longitudinal dataset from IPEDS data

IPEDS collects annual survey data from colleges/universities. We will create a longitudinal dataset about university characteristics by appending/stacking annual data from the 2014-15, 2015-16, and 2016-17 academic years.

# Load 2014-15 IPEDS data
admit14_15 <- read_dta("https://github.com/anyone-can-cook/rclass1/raw/master/data/ipeds/ic/ic14_15_admit.dta") %>%
  select(unitid, endyear, sector, contains("numapply"), contains("numadmit"))

# Load 2015-16 IPEDS data
admit15_16 <- read_dta("https://github.com/anyone-can-cook/rclass1/raw/master/data/ipeds/ic/ic15_16_admit.dta") %>%
  select(unitid, endyear, sector, contains("numapply"), contains("numadmit"))

# Load 2016-17 IPEDS data
admit16_17 <- read_dta("https://github.com/anyone-can-cook/rclass1/raw/master/data/ipeds/ic/ic16_17_admit.dta") %>% select(unitid, endyear, sector, contains("numapply"), contains("numadmit"))
Table `admit14_15`
unitid endyear sector numapplymen numapplywom numapplytot numadmitmen numadmitwom numadmittot
100654 2015 1 3852 6049 9901 1977 3227 5204
100663 2015 1 2293 3417 5710 2004 2889 4893
100706 2015 1 1165 939 2104 981 745 1726
Table `admit15_16`
unitid endyear sector numapplymen numapplywom numapplytot numadmitmen numadmitwom numadmittot
100654 2016 1 3030 4871 7901 1999 3167 5166
100663 2016 1 2686 4986 7672 1710 2926 4636
100706 2016 1 1801 1507 3308 1509 1177 2686
Table `admit16_17`
unitid endyear sector numapplymen numapplywom numapplytot numadmitmen numadmitwom numadmittot
100654 2017 1 2725 4318 7043 2276 3878 6154
100663 2017 1 3510 5949 9459 2062 3437 5499
100706 2017 1 2385 2160 4545 1930 1537 3467


Use bind_rows() to stack the dataframes on top of one another:

admit_append <- bind_rows(admit14_15, admit15_16, admit16_17)

# The stacked dataframe has 9 columns and 6514 rows (sum of rows from the 3 input dataframes)
dim(admit_append)
#> [1] 6514    9
admit_append %>% arrange(unitid, endyear) %>% head(n = 9)
unitid endyear sector numapplymen numapplywom numapplytot numadmitmen numadmitwom numadmittot
100654 2015 1 3852 6049 9901 1977 3227 5204
100654 2016 1 3030 4871 7901 1999 3167 5166
100654 2017 1 2725 4318 7043 2276 3878 6154
100663 2015 1 2293 3417 5710 2004 2889 4893
100663 2016 1 2686 4986 7672 1710 2926 4636
100663 2017 1 3510 5949 9459 2062 3437 5499
100706 2015 1 1165 939 2104 981 745 1726
100706 2016 1 1801 1507 3308 1509 1177 2686
100706 2017 1 2385 2160 4545 1930 1537 3467


Example: Appending datasets where not all columns match

If we try appending datasets where not all columns match, this will result in NA being filled for observations where columns do not match.

# Remove some variables from 2014-15 dataframe
admit14_15 <- admit14_15 %>% select(-contains("numadmit"))

# Remove some variables from 2015-16 dataframe
admit15_16 <- admit15_16 %>% select(-contains("numapply"))
Table `admit14_15`
unitid endyear sector numapplymen numapplywom numapplytot
100654 2015 1 3852 6049 9901
100663 2015 1 2293 3417 5710
100706 2015 1 1165 939 2104
Table `admit15_16`
unitid endyear sector numadmitmen numadmitwom numadmittot
100654 2016 1 1999 3167 5166
100663 2016 1 1710 2926 4636
100706 2016 1 1509 1177 2686
Table `admit16_17`
unitid endyear sector numapplymen numapplywom numapplytot numadmitmen numadmitwom numadmittot
100654 2017 1 2725 4318 7043 2276 3878 6154
100663 2017 1 3510 5949 9459 2062 3437 5499
100706 2017 1 2385 2160 4545 1930 1537 3467


Use bind_rows() to stack the dataframes on top of one another:

admit_append <- bind_rows(admit14_15, admit15_16, admit16_17)

# NA's are filled for missing values after the datasets are appended together
admit_append %>% arrange(unitid, endyear) %>% head(n = 9)
unitid endyear sector numapplymen numapplywom numapplytot numadmitmen numadmitwom numadmittot
100654 2015 1 3852 6049 9901 NA NA NA
100654 2016 1 NA NA NA 1999 3167 5166
100654 2017 1 2725 4318 7043 2276 3878 6154
100663 2015 1 2293 3417 5710 NA NA NA
100663 2016 1 NA NA NA 1710 2926 4636
100663 2017 1 3510 5949 9459 2062 3437 5499
100706 2015 1 1165 939 2104 NA NA NA
100706 2016 1 NA NA NA 1509 1177 2686
100706 2017 1 2385 2160 4545 1930 1537 3467

6 Appendix

6.1 Troubleshooting join problems


Overcoming join problems before they arise
  1. Start by investigating the data structure of tables you are going to merge
    • Identify the primary key in each table
      • This investigation should be based on your understanding of the data and reading data documentation rather than checking if each combination of variables is a primary key
    • Does either table have missing or strange values (e.g., -8) for the primary key; if so, these observations won’t match
  2. Before joining, make sure that key you will use for joining uniquely identifies observations in at least one of the datasets and that the key variable(s) is present in both datasets
    • Investigate whether key variables have different names across the two tables. If different, then you will have to adjust syntax of your join statement accordingly.
  3. Before joining, you also want to make sure the key variable in one table and key variable in another table are the same type (both numeric or both string, etc.)
    • You could use the typeof() function or the str() function
    • Change type with this command x$key <- as.double(x$key) or x$key <- as.character(x$key)
    • If you try to join with key variables of different type, you will get this error message: Can’t join on ‘key’ x ‘key’ because of incompatible types (character / numeric)
  4. Think about which observations you want retained after joining
    • Think about which dataset should be the x table and which should be the y table
    • Think about whether you want an inner, left, right, or full join
  5. Since mutating joins keep all variables in x and y, you may want to keep only specific variables in x and/or y as a prior step to joining
    • Make sure that non-key variables from tables have different names; if duplicate names exist, the default is to add .x and .y to the end of the variable name. For example, if you have two tables with non-key variables with the name type and you join them, you will end up with variables (columns) type.x and type.y.


Overcoming join problems when they do arise
  • Identify which observations don’t match
    • anti_join() is your friend here
  • Investigate the reasons that observations don’t match
    • Investigating joins is a craft that takes some practice getting good at; this is essentially an exercise in exploratory data analysis for the purpose of data quality
    • First, you have to care about data quality
    • Identifying causes for non-matches usually involves consulting data documentation for both tables and performing basic descriptive statistics (e.g., frequency tables) on specific variables that documentation suggests may be relevant for whether observations match or not

6.2 How to think about different types of joins and perform one

We noticed that many students were having trouble with understanding the difference between the different types of joins and how to use them. We hope that this section can help clarify some of the frequently asked questions.

6.2.1 Types of joins

Oftentimes, we have a primary table that we are interested in - for example, a table of recruiting events where each observation is an event held at a specific location. In order to understand more about each event, we can join in other data sources to it, such as zip code level data obtained from the Census.

Source: R for Data Science by Wickham

The main difference between the types of joins is which rows they keep from each of the 2 tables you are joining (i.e., what rows you see in the end result of the join). For example, if our recruiting events data is the x (left) table and zip code data is the y (right) table:

  • Inner join: The result only includes events at a zip code that has Census data available and Census zip codes where at least 1 event occurred
    • Any event at a zip code that does not have Census data and any Census zip code where no events occurred will be dropped
  • Left join: The result includes all recruiting events in our original table, whether or not Census data for the event’s zip code is available or not
    • Any Census zip code where no events occurred will be dropped
  • Right join: The result includes all zip codes in the Census data, whether or not any events occurred there
    • Any event at a zip code that does not have Census data will be dropped
  • Full join: The result includes all recruiting events in our original table and all zip codes in the Census data
    • No observations in either of the original tables will be dropped

When deciding which type of join to use, we need to consider what our focus is. Here, we want to analyze the recruiting events, so we would want the end result of our join to include all events in our original events table, where each event has zip code data attached when available. According to the bulleted list above, that would be a left join using the recruiting events data as the x (left) table and zip code data as the y (right) table.

What about an inner join? You may wonder, if we have an event at a zip code that does not have any zip code level data, what is the point of keeping that observation when we can’t conduct any analysis on zip code characteristics? The answer may depend on your goal for the join. But oftentimes, we are joining more than 2 tables. For example, we may want to join zip code level data, HS level data, and university level data to each event. Not all events will have a match in each of the 3 data sources, but left joins will allow us to keep all events in the end result regardless. This is useful because the recruiting events are our main focus, and we can then choose whichever type of analysis we want to perform (zip code, HS, or univ-level).

6.2.2 Deciding on x and y table in a join

As seen in the Venn diagram above, each join function takes in 2 arguments - we refer to the 1st arg x as the left table and the 2nd arg y as the right table. Which table we choose to be the x and which to be the y only matters when we are performing a left or right join, since inner and full joins are symmetrical.

In our previous example, if we use the recruiting events table for x and zip code table for y, then a left join will keep all observations in the recruiting events table and drop all Census zip codes where no events occurred. But if we were to use the zip code table for x and recruiting events table for y, then it would be a right join that keeps all recruiting events and drop any Census zip code where no events occurred.

Given the above 2 options to get the same results, performing a left join is more conventional and can be easier to visualize. For example, imagine performing a series of left joins to the recruiting events table (e.g., left join zip code level data to each event, left join HS level data to each event that occurred at a HS, left join university level data to each event that occurred at a university), we can think of it as continually chaining the tables from left to right.