1 Introduction

1.1 Libraries and datasets we will use

Load packages:

library(tidyverse)
library(haven)
library(labelled)


Load datasets:

# NCES Digest Table 204.10
load(url('https://github.com/anyone-can-cook/rclass1/raw/master/data/nces_digest/nces_digest_tbl_204_10.RData'))

# IPEDS EFFY Table
ipeds_table <- read_dta('https://github.com/anyone-can-cook/rclass1/raw/master/data/ipeds/effy/ey15-16_hc.dta', encoding = NULL)

1.2 Lecture overview

Creating analysis datasets often require changing the organizational structure of data. For example:

  • You want analysis dataset to have one obs per student, but your data has one obs per student-course
  • You want analysis dataset to have one obs per institution, but enrollment data has one obs per institution-enrollment level

Two common ways to change organizational structure of data:

  1. Use group_by() to perform calculations separately within groups and then use summarise() to create an object with one observation per group. Examples:
    • Creating objects containing summary statistics that are basis for tables and graphs
    • Creating student-transcript level GPA variable from student-transcript-course level data
  2. Reshape your data – called tidying in the R tidyverse world – by transforming columns (variables) into rows (observations) and vice versa
    • This is the focus of this lecture, where we will look at transforming untidy data into tidy data.

Working with tidy data has many benefits:

  • It is a consistent way of storing data
  • R is also well-suited for working with tidy data due to its vectorized nature. There’s also many packages in tidyverse that are designed to work with tidy data, such as tidyr.

Show index and example datasets in tidyr package:

help(package="tidyr")

# Note that example datasets (table1, table2, etc) are listed in the index alongside functions
tidyr::table1

df1 <- table1
str(df1)

table2
table3

2 Data structure vs. data semantics

Before we define “tidy data”, we will spend significant time defining and discussing some core terms/concepts about datasets. This discussion draws from the 2014 article Tidy Data by Hadley Wickham.

Wickham (2014) distinguishes between data structure (layout) and data semantics (concepts).

2.1 Dataset structure

Dataset structure refers to the “physical layout” of a dataset.

  • Typically, datasets are “rectangular tables made up of rows and columns
  • A cell is the intersection of one column and one row (think cells in Microsoft Excel)

There are many alternative data structures to present the same underlying data.


Example: 2 different ways to structure the same data (rows and columns are transposed)

Structure A: treatment as columns, names as rows
name treatment_a treatment_b
John Smith NA 2
Jane Doe 16 11
Mary Johnson 3 1
Structure B: treatment as rows, names as columns
treatment John Smith Jane Doe Mary Johnson
treatment_a NA 16 3
treatment_b 2 11 1

2.2 Dataset semantics

Data semantics refer to the underlying meaning of the data being displayed, or how we think of the data conceptually.

The difference between data structure and data semantics:

  • Data structure refers to the physical layout of the data (e.g., what are the rows and columns)
  • Data semantics – which were introduced by Wickham (2014) – refer to the meaning of the data itself


Example: Describing a dataset

Looking back at the previous example, we can see that although the data structure is different, the tables represent the same underlying data: Each person can partake in any of multiple treatments and can have a result from each treatment.

In the next section, we’ll define some terms to make it easier to describe the semantics (meaning) of the data displayed in the tables.

2.2.1 Values, variables, and observations

“A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation.”

Source: Wickham (2014, p. 3)


Terminology:

  • Value: A single element within some data structure (e.g., vector, list)
  • Variables: “A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units”
  • Observations: “An observation contains all values measured on the same unit (like a person or a day) across attributes”


Example: Describing data semantics [revisited]

  • The experimental design data has 3 variables:
    • person, with 3 possible values (John Smith, Mary Johnson, Jane Doe)
    • treatment, with 2 possible values (treatment_a, treatment_b)
    • result, with 6 possible values (one of which is a missing value)
  • Since measurements were taken for each person for each experiment, there are 6 observations:
    • 3 people x 2 treatments = 6 observations
    • For each observation, the same attribute (result) was measured
  • There is a total of 18 values:
    • 3 variables x 6 observations = 18 values

2.3 Unit of analysis

unit of analysis [my term, not Wickham’s]:

  • What each row represents in a dataset (referring to physical layout of dataset).

Examples of different units of analysis:

  • if each row represents a student, you have student level data
  • if each row represents a student-course, you have student-course level data
  • if each row represents an organization-year, you have organization-year level data

Questions:

  • What does each row represent in the data frame object structure_a?
structure_a
#> # A tibble: 3 x 3
#>   name         treatment_a treatment_b
#>   <chr>              <dbl>       <dbl>
#> 1 John Smith            NA           2
#> 2 Jane Doe              16          11
#> 3 Mary Johnson           3           1
  • What does each row represent in the data frame object structure_b?
structure_b
#> # A tibble: 2 x 4
#>   treatment   `John Smith` `Jane Doe` `Mary Johnson`
#>   <chr>              <dbl>      <dbl>          <dbl>
#> 1 treatment_a           NA         16              3
#> 2 treatment_b            2         11              1
  • What does each row represent in the data frame object ipeds_hc_temp?
    • Below we load data on 12-month enrollment headcount for 2015-16 academic year from the Integrated Postsecondary Education Data System (IPEDS)
ipeds_hc_temp <- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/ipeds/effy/ey15-16_hc.dta", encoding=NULL) %>% 
  select(unitid,lstudy,efytotlt) %>% arrange(unitid,lstudy)

ipeds_hc_temp
#> # A tibble: 15,726 x 3
#>    unitid                lstudy efytotlt
#>     <dbl>             <dbl+lbl>    <dbl>
#>  1 100654   1 [Undergraduate]       4865
#>  2 100654   3 [Graduate]            1292
#>  3 100654 999 [Generated total]     6157
#>  4 100663   1 [Undergraduate]      13440
#>  5 100663   3 [Graduate]            8114
#>  6 100663 999 [Generated total]    21554
#>  7 100690   1 [Undergraduate]        415
#>  8 100690   3 [Graduate]             415
#>  9 100690 999 [Generated total]      830
#> 10 100706   1 [Undergraduate]       6994
#> # … with 15,716 more rows
#show variable labels and value labels
#ipeds_hc_temp %>% var_label()
#ipeds_hc_temp %>% val_labels()

#print a few obs, with value labels rather than variable values
ipeds_hc_temp %>% head(n=10) %>% as_factor()
#> # A tibble: 10 x 3
#>    unitid lstudy          efytotlt
#>     <dbl> <fct>              <dbl>
#>  1 100654 Undergraduate       4865
#>  2 100654 Graduate            1292
#>  3 100654 Generated total     6157
#>  4 100663 Undergraduate      13440
#>  5 100663 Graduate            8114
#>  6 100663 Generated total    21554
#>  7 100690 Undergraduate        415
#>  8 100690 Graduate             415
#>  9 100690 Generated total      830
#> 10 100706 Undergraduate       6994

2.3.1 Which variable(s) uniquely identify rows in a data frame

Identifying which combination of variables uniquely identifies rows in a data frame helps you identify the “unit of analysis” and understand the “structure” of your dataset

  • Said differently: for each value of this variable (or combination of variables), there is only one row
  • Very important for reshaping/tidying data (this week) and very important for joining/merging data frames (next week)
  • Sometimes a codebook will explicitly tell you which vars uniquely identify rows
  • Sometimes you have to figure this out through investigation
    • focus on ID variables and categorical variables that identify what each row represents; not continuous numeric variables like total_enrollment or family_income

Task: Let’s try to identify the variable(s) that uniquely identify rows in ipeds_hc_temp

  • Multiple ways of doing this
  • I’ll give you some code for one approach; just try to follow along and understand
names(ipeds_hc_temp)
#> [1] "unitid"   "lstudy"   "efytotlt"
ipeds_hc_temp %>% head(n=10)
#> # A tibble: 10 x 3
#>    unitid                lstudy efytotlt
#>     <dbl>             <dbl+lbl>    <dbl>
#>  1 100654   1 [Undergraduate]       4865
#>  2 100654   3 [Graduate]            1292
#>  3 100654 999 [Generated total]     6157
#>  4 100663   1 [Undergraduate]      13440
#>  5 100663   3 [Graduate]            8114
#>  6 100663 999 [Generated total]    21554
#>  7 100690   1 [Undergraduate]        415
#>  8 100690   3 [Graduate]             415
#>  9 100690 999 [Generated total]      830
#> 10 100706   1 [Undergraduate]       6994


First, Let’s investigate whether the ID variable unitid uniquely identifies rows in data frame ipeds_hc_temp

  • I’ll annotate code for each step
ipeds_hc_temp %>% # start with data frame object ipeds_hc_temp
  group_by(unitid) %>% # group by unitid
  summarise(n_per_group=n()) %>% # create measure of number of obs per group
  ungroup %>% # ungroup (otherwise frequency table [next step] created) separately for each group (i.e., separate frequency table for each value of unitid)
  count(n_per_group) # frequency of number of observations per group
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#>   n_per_group     n
#>         <int> <int>
#> 1           2  5127
#> 2           3  1824

What does above output tell us?

  • There are 5,127 values of unitid that have 2 rows for that value of unitid
  • There are 1,824 values of unitid that have 3 rows for that value of unitid
  • Note: 2*5127+3*1824== 15,726 which is the number of observations in ipeds_hc_temp
  • Conclusion: the variable unitid does not uniquely identify rows in the data frame ipeds_hc_temp


Second, Let’s investigate whether the comination of unitid and lstudy uniquely identifies rows in data frame ipeds_hc_temp

ipeds_hc_temp %>% # start with data frame object ipeds_hc_temp
  group_by(unitid,lstudy) %>% # group by unitid and lstudy
  summarise(n_per_group=n()) %>% # create measure of number of obs per group
  ungroup %>% # ungroup (otherwise frequency table [next step] created) separately for each group (i.e., separate frequency table for each value of unitid)
  count(n_per_group) # frequency of number of observations per group
#> `summarise()` regrouping output by 'unitid' (override with `.groups` argument)
#> # A tibble: 1 x 2
#>   n_per_group     n
#>         <int> <int>
#> 1           1 15726

What does above output tell us?

  • There is 1 row each unique combination of unitid and lstudy
  • Conclusion: the variables unitid and lstudy uniquely identify rows in the data frame ipeds_hc_temp

3 Tidy vs. untidy data

3.1 Defining tidy data

Tidy data is a standard way of mapping the meaning of a dataset to its structure.”

Source: Wickham (2014, p. 4)


Tidy data always follow these 3 interrelated rules:

  1. Each variable must have its own column
  2. Each observation must have its own row
  3. Each value must have its own cell