library(tidyverse)
library(haven)
library(labelled)
Tidy data
1 Introduction
Load packages:
If package not yet installed, then must install before you load. Install in “console” rather than .qmd file:
- Generic syntax:
install.packages("package_name")
- Install “tidyverse”:
install.packages("tidyverse")
Note: When we load package, name of package is not in quotes; but when we install package, name of package is in quotes:
install.packages("haven")
library(haven)
1.1 Datasets we will use
Load datasets:
We will use public-use data from the National Center for Education Statistics (NCES) Educational Longitudinal Survey (ELS) of 2002:
- Recall from the lecture on ggplot, this data follows 10th graders from 2002 until 2012
- We will be working with the NCES Digest Table 204.10, which represents the number and percentage of public school students eligible for free or reduced lunch by state for the years 2000, 2010, 2011, and 2012.
# NCES Digest Table 204.10
load(url('https://github.com/anyone-can-cook/rclass1/raw/master/data/nces_digest/nces_digest_tbl_204_10.RData'))
We will also use data from the Integrated Postsecondary Education Data System (IPEDS), which is collected by the US Department of Education’s National Center for Education Statistics (NCES). IPEDS gathers information from all institutions participating in federal student financial aid programs.
- We are using data on the 12-month unduplicated headcount by race/ethnicity, gender and level of student.
# IPEDS EFFY Table
<- read_dta('https://github.com/anyone-can-cook/rclass1/raw/master/data/ipeds/effy/ey15-16_hc.dta', encoding = NULL) ipeds_table
1.2 Lecture overview
Creating analysis datasets often require changing the organizational structure of data. For example:
- You want analysis dataset to have one observation per student, but your data has one observation per student-course.
- You want analysis dataset to have one observation per institution, but enrollment data has one observation per institution-enrollment level.
Two common ways to change the organizational structure of data:
- Use
group_by()
to perform calculations separately within groups and then usesummarise()
to create an object with one observation per group.- Creating objects containing summary statistics that are basis for tables and graphs
- For example, creating student-transcript level GPA variable from student-transcript-course level data
- Reshape your data– called tidying in the R tidyverse world– by transforming columns (variables) into rows (observations) and vice versa
- This is the focus of this lecture, where we will look at transforming untidy data into tidy data.
Working with tidy data has many benefits:
- It is a consistent way of storing data
- R is also well-suited for working with tidy data due to its vectorized nature – that is, each variable is an atomic vector, where each element is an observation for that variable. There’s also many packages in
tidyverse
that are designed to work with tidy data, such astidyr
.
Show index and example datasets in tidyr
package:
help(package="tidyr")
# Note that example datasets (table1, table2, etc) are listed in the index alongside functions
::table1
tidyr
<- table1
df1 str(df1)
table2 table3
2 Data structure vs. data semantics
Before we define “tidy data”, we will spend significant time defining and discussing some core terms/concepts about datasets. This discussion draws from the 2014 article Tidy Data by Hadley Wickham.
Wickham (2014) distinguishes between data structure (layout) and data semantics (concepts).
2.1 Dataset structure
Dataset structure refers to the “physical layout” of a dataset.
- Typically, datasets are “rectangular tables made up of rows and columns”
- A cell is the intersection of one column and one row (think cells in Microsoft Excel or Google sheets)
There are many alternative data structures to present the same underlying data.
Example: 2 different ways to structure the same data (rows and columns are transposed)
name | treatment_a | treatment_b |
---|---|---|
John Smith | NA | 2 |
Jane Doe | 16 | 11 |
Mary Johnson | 3 | 1 |
treatment | John Smith | Jane Doe | Mary Johnson |
---|---|---|---|
treatment_a | NA | 16 | 3 |
treatment_b | 2 | 11 | 1 |
2.2 Dataset semantics
Data semantics refer to the underlying meaning of the data being displayed, or how we think of the data conceptually.
The difference between data structure and data semantics:
- Data structure refers to the physical layout of the data (e.g., what are the rows and columns)
- Data semantics – which were introduced by Wickham (2014) – refer to the meaning of the data itself
Example: Describing a dataset
Looking back at the previous example, we can see that although the data structure is different, the tables represent the same underlying data: Each person can partake in any of multiple treatments and can have a result from each treatment.
In the next section, we’ll define some terms to make it easier to describe the semantics (meaning) of the data displayed in the tables.
2.2.1 Values, variables, and observations
“A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation.”
Source: Wickham (2014, p. 3)
Terminology:
- Value: A single element within some data structure (e.g., vector, list)
- Variables: “A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units”
- Observations: “An observation contains all values measured on the same unit (e.g., corporation-year) across attributes (variables)”
- e.g., unit is corporation-year and one observation contains values of revenue and expense variables for Google for the 2021 fiscal year
Example: Describing data semantics [revisited]
- The experimental design data has 3 variables:
person
, with 3 possible values (John Smith, Mary Johnson, Jane Doe)treatment
, with 2 possible values (treatment_a, treatment_b)result
, with 6 possible values (one of which is a missing value)
- Since measurements were taken for each
person
for eachexperiment
, there are 6 observations:- 3 people x 2 treatments = 6 observations
- For each observation, the same attribute (result) was measured
- There is a total of 18 values:
- 3 variables x 6 observations = 18 values
2.3 Unit of analysis
unit of analysis [my term, not Wickham’s]:
- What each row represents in a dataset (referring to the physical layout of the dataset).
Examples of different units of analysis:
- if each row represents a student, you have student-level data
- if each row represents a student-course, you have student-course level data
- if each row represents an organization-year, you have organization-year level data
Questions:
- What does each row represent in the data frame object
structure_a
?
structure_a#> # A tibble: 3 × 3
#> name treatment_a treatment_b
#> <chr> <dbl> <dbl>
#> 1 John Smith NA 2
#> 2 Jane Doe 16 11
#> 3 Mary Johnson 3 1
- What does each row represent in the data frame object
structure_b
?
structure_b#> # A tibble: 2 × 4
#> treatment `John Smith` `Jane Doe` `Mary Johnson`
#> <chr> <dbl> <dbl> <dbl>
#> 1 treatment_a NA 16 3
#> 2 treatment_b 2 11 1
What does each row represent in the data frame object
ipeds_hc_temp
?- Below we load data on 12-month enrollment headcount for 2015-16 academic year from the Integrated Postsecondary Education Data System (IPEDS)
<- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/ipeds/effy/ey15-16_hc.dta", encoding=NULL) %>%
ipeds_hc_temp select(unitid,lstudy,efytotlt) %>% arrange(unitid,lstudy)
ipeds_hc_temp#> # A tibble: 15,726 × 3
#> unitid lstudy efytotlt
#> <dbl> <dbl+lbl> <dbl>
#> 1 100654 1 [Undergraduate] 4865
#> 2 100654 3 [Graduate] 1292
#> 3 100654 999 [Generated total] 6157
#> 4 100663 1 [Undergraduate] 13440
#> 5 100663 3 [Graduate] 8114
#> 6 100663 999 [Generated total] 21554
#> 7 100690 1 [Undergraduate] 415
#> 8 100690 3 [Graduate] 415
#> 9 100690 999 [Generated total] 830
#> 10 100706 1 [Undergraduate] 6994
#> # ℹ 15,716 more rows
#show variable labels and value labels
#ipeds_hc_temp %>% var_label()
#ipeds_hc_temp %>% val_labels()
#print a few obs, with value labels rather than variable values
%>% head(n=10) %>% as_factor()
ipeds_hc_temp #> # A tibble: 10 × 3
#> unitid lstudy efytotlt
#> <dbl> <fct> <dbl>
#> 1 100654 Undergraduate 4865
#> 2 100654 Graduate 1292
#> 3 100654 Generated total 6157
#> 4 100663 Undergraduate 13440
#> 5 100663 Graduate 8114
#> 6 100663 Generated total 21554
#> 7 100690 Undergraduate 415
#> 8 100690 Graduate 415
#> 9 100690 Generated total 830
#> 10 100706 Undergraduate 6994
2.3.1 Which variable(s) uniquely identify rows in a data frame
Identifying which combination of variables uniquely identifies rows in a data frame helps you identify the “unit of analysis” and understand the “structure” of your dataset
- Said differently: for each value of this variable (or combination of variables), there is only one row
- Very important for reshaping/tidying data (this week) and very important for joining/merging data frames (next week)
- Sometimes a codebook will explicitly tell you which vars uniquely identify rows
- Sometimes you have to figure this out through investigation
- focus on ID variables and categorical variables that identify what each row represents; not continuous numeric variables like
total_enrollment
orfamily_income
- focus on ID variables and categorical variables that identify what each row represents; not continuous numeric variables like
Task: Let’s try to identify the variable(s) that uniquely identify rows in ipeds_hc_temp
- Multiple ways of doing this
- I’ll give you some code for one approach; just try to follow along and understand
names(ipeds_hc_temp)
#> [1] "unitid" "lstudy" "efytotlt"
%>% head(n=10)
ipeds_hc_temp #> # A tibble: 10 × 3
#> unitid lstudy efytotlt
#> <dbl> <dbl+lbl> <dbl>
#> 1 100654 1 [Undergraduate] 4865
#> 2 100654 3 [Graduate] 1292
#> 3 100654 999 [Generated total] 6157
#> 4 100663 1 [Undergraduate] 13440
#> 5 100663 3 [Graduate] 8114
#> 6 100663 999 [Generated total] 21554
#> 7 100690 1 [Undergraduate] 415
#> 8 100690 3 [Graduate] 415
#> 9 100690 999 [Generated total] 830
#> 10 100706 1 [Undergraduate] 6994
First, Let’s investigate whether the ID variable unitid
uniquely identifies rows in data frame ipeds_hc_temp
- I’ll annotate code for each step
%>% # start with data frame object ipeds_hc_temp
ipeds_hc_temp group_by(unitid) %>% # group by unitid
summarise(n_per_group=n()) %>% # create measure of number of obs per group
%>% # ungroup (otherwise frequency table [next step] created) separately for each group (i.e., separate frequency table for each value of unitid)
ungroup count(n_per_group) # frequency of number of observations per group
#> # A tibble: 2 × 2
#> n_per_group n
#> <int> <int>
#> 1 2 5127
#> 2 3 1824
What does above output tell us?
- There are 5,127 values of
unitid
that have 2 rows for that value ofunitid
- There are 1,824 values of
unitid
that have 3 rows for that value ofunitid
- Note:
2*5127+3*1824==
15,726 which is the number of observations inipeds_hc_temp
- Conclusion: the variable
unitid
does not uniquely identify rows in the data frameipeds_hc_temp
Second, Let’s investigate whether the combination of unitid
and lstudy
uniquely identifies rows in data frame ipeds_hc_temp
.
%>% # start with data frame object ipeds_hc_temp
ipeds_hc_temp group_by(unitid,lstudy) %>% # group by unitid and lstudy
summarise(n_per_group=n()) %>% # create measure of number of obs per group
%>% # ungroup (otherwise frequency table [next step] created) separately for each group (i.e., separate frequency table for each value of unitid)
ungroup count(n_per_group) # frequency of number of observations per group
#> `summarise()` has grouped output by 'unitid'. You can override using the
#> `.groups` argument.
#> # A tibble: 1 × 2
#> n_per_group n
#> <int> <int>
#> 1 1 15726
What does above output tell us?
- There is
1
row each unique combination ofunitid
andlstudy
- Conclusion: the variables
unitid
andlstudy
uniquely identify rows in the data frameipeds_hc_temp
3 Tidy vs. untidy data
3.1 Defining tidy data
“Tidy data is a standard way of mapping the meaning of a dataset to its structure.”
Source: Wickham (2014, p. 4)
Tidy data always follow these 3 interrelated rules:
- Each variable must have its own column
- Each observation must have its own row
- Each value must have its own cell
Example: Representing data in tidy form
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
This table shows the same data from earlier, but now the data is in tidy form because it satisfies the 3 rules:
- All 4 variables (
country
,year
,cases
,population
) have its own column - All 6 observations (combination of
country
-year
) have its own row - All 24 values have its own cell in the table
3.2 Diagnosing untidy data
Untidy data is any data that do not fully follow the 3 rules of tidy data defined previously.
Example: Tidy vs. untidy data
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
country | year | type | count |
---|---|---|---|
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
The above tables show the same data, but table1
is in tidy form while table2
is untidy because it does not fully follow the 3 rules. Let’s diagnose the problems with table2
by answering these questions:
- Does each variable have its own column?
- If not, how does the dataset violate this principle?
- What should the variables be?
- Does each observation have its own row?
- If not, how does the dataset violate this principle?
- What does each row actually represent?
- What should each row represent?
- Does each value have its own cell?
- If not, how does the dataset violate this principle?
Solutions
- Does each variable have its own column? No
- If not, how does the dataset violate this principle?
cases
andpopulation
should be two different variables because they are different attributes, but intable2
these two attributes are recorded in the columntype
and the associated value for each type is recorded in the columncount
. - What should the variables be?
country
,year
,cases
,population
- If not, how does the dataset violate this principle?
- Does each observation have its own row? No
- If not, how does the dataset violate this principle? There is one observation for each
country
-year
-type
. But the values of type (cases
,population
) represent attributes of a unit, which should be represented by distinct variables rather than rows. Sotable2
has two rows per observation but it should have one row per observation. - What does each row actually represent?
country
-year
-type
- What should each row represent?
country
-year
- If not, how does the dataset violate this principle? There is one observation for each
- Does each value have its own cell? Yes
- If not, how does the dataset violate this principle?
3.3 Common types of untidy data
Tidy data can only have one organizational structure, while untidy data can come in various different forms. It is important to identify the most common types of untidy data, so that we can develop solutions for each.
Below are some of the common problems of untidy data.
3.3.1 Column headers are values, not variable names
country | 1999 | 2000 |
---|---|---|
Afghanistan | 19987071 | 20595360 |
Brazil | 172006362 | 174504898 |
China | 1272915272 | 1280428583 |
Here, 1999
and 2000
are not names of variables, but values of a variable (i.e., year
). This form results in:
- A single variable spreading over multiple columns (both the last 2 columns contain values of country population)
- A single row containing multiple observations (e.g., population in 1999 Afghanistan and population in 2000 Afghanistan should be different observations)
3.3.2 Multiple variables are stored in one column
country | year | type | count |
---|---|---|---|
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
cases
and population
are separate variables, but are stored here in the same column. This form results in:
- An observation being scattered across multiple rows (e.g., 1999 Afghanistan data should be a single observation/row)
- The values of a column not sharing the same units (the
count
column contains both number of cases and number of people)
3.3.3 Column contains data from two variables
country | year | rate |
---|---|---|
Afghanistan | 1999 | 745/19987071 |
Afghanistan | 2000 | 2666/20595360 |
Brazil | 1999 | 37737/172006362 |
Brazil | 2000 | 80488/174504898 |
China | 1999 | 212258/1272915272 |
China | 2000 | 213766/1280428583 |
745
and 19987071
in the rate
column belong in separate columns.
3.3.4 Single variable stored in multiple columns
country | century | year | rate |
---|---|---|---|
Afghanistan | 19 | 99 | 745/19987071 |
Afghanistan | 20 | 00 | 2666/20595360 |
Brazil | 19 | 99 | 37737/172006362 |
Brazil | 20 | 00 | 80488/174504898 |
China | 19 | 99 | 212258/1272915272 |
China | 20 | 00 | 213766/1280428583 |
The values in century
and year
should be combined to form a 4-digit year
variable.
4 Tidying untidy data
Approach the task of “tidying” data – or more generally, the task of “reshaping” data – as a two-step process
- Conceptual task of understanding how the data are organized and how the data should be organized (in some order):
- Define these concepts for your dataset
- variables: what variables should the resulting dataset have
- observations: each observation should represent what in the resulting dataset
- How is the dataset currently structured (columns represent what, rows represent what)
- which rules of tidy data are violated; why?
- write out on a piece of scratch paper what the tidy dataset should look like
- Define these concepts for your dataset
- Technical task of writing the code that reshapes/transforms the dataset from untidy to tidy
pivot_longer()
function reshapes data from “wide to long”- e.g., your dataset has separate variables for each year, but you should have one variable “year” that has rows for each value of year
pivot_wider()
function reshapes data from “long to wide”
4.1 Reshaping wide to long: pivot_longer()
Now that we have a better understanding of the differences between tidy and untidy data, let’s practice reshaping our data. In the next two sections we are introducing the pivot_longer()
and pivot_wider()
functions to reshape our data.
The pivot_longer()
function:
?pivot_longer
# SYNTAX AND DEFAULT VALUES
pivot_longer(data, cols, names_to = "name", names_prefix = NULL,
names_sep = NULL, names_pattern = NULL, names_ptypes = list(),
names_repair = "check_unique", values_to = "value",
values_drop_na = FALSE, values_ptypes = list())
- Function: “lengthens” data, increasing the number of rows and decreasing the number of columns
- Arguments (selected):
data
: Dataframe to pivotcols
: Columns to pivot into longer formatnames_to
: Name of the column to create from the data stored in the column names ofdata
values_to
: Name of the column to create from the data stored in cell valuesnames_sep
: Ifnames_to
contains multiple values, these arguments control how the column name is broken up.
Example: Tidying table4b
(reshaping wide to long)
As seen above, the first common reason for untidy data is that some of the column names are not names of variables, but values of a variable (e.g., table4a
, table4b
):
country | 1999 | 2000 |
---|---|---|
Afghanistan | 19987071 | 20595360 |
Brazil | 172006362 | 174504898 |
China | 1272915272 | 1280428583 |
The solution to this problem is to transform the untidy columns (which represent variable values) into rows. Thus, we want to transform table4b
into something that looks like this:
country | year | population |
---|---|---|
Afghanistan | 1999 | 19987071 |
Afghanistan | 2000 | 20595360 |
Brazil | 1999 | 172006362 |
Brazil | 2000 | 174504898 |
China | 1999 | 1272915272 |
China | 2000 | 1280428583 |
This can be achieved using pivot_longer()
:
table4b#> # A tibble: 3 × 3
#> country `1999` `2000`
#> <chr> <dbl> <dbl>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
%>%
table4b pivot_longer(
cols = c('1999', '2000'), # pivot `1999` and `2000` columns into longer format
names_to = 'year', # name of column that holds the pivoted values
values_to = 'population' # name of column that holds the original cell values
)#> # A tibble: 6 × 3
#> country year population
#> <chr> <chr> <dbl>
#> 1 Afghanistan 1999 19987071
#> 2 Afghanistan 2000 20595360
#> 3 Brazil 1999 172006362
#> 4 Brazil 2000 174504898
#> 5 China 1999 1272915272
#> 6 China 2000 1280428583
Example: Choosing which column not to pivot
Looking at the previous example, we could have also equivalently pivoted the table by specifying which column we want to keep unchanged:
%>%
table4b pivot_longer(
cols = -country, # pivot all columns except `country`
names_to = 'year', # name of column that holds the pivoted values
values_to = 'population' # name of column that holds the original cell values
)#> # A tibble: 6 × 3
#> country year population
#> <chr> <chr> <dbl>
#> 1 Afghanistan 1999 19987071
#> 2 Afghanistan 2000 20595360
#> 3 Brazil 1999 172006362
#> 4 Brazil 2000 174504898
#> 5 China 1999 1272915272
#> 6 China 2000 1280428583
Example: Renaming pivoted columns
Sometimes, the name of the columns we want to pivot contain additional information that we want to remove before transforming to tidy data. Consider the following example where the columns all contain the prefix tot_
:
nces_table1#> # A tibble: 51 × 5
#> state tot_2000 tot_2010 tot_2011 tot_2012
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Alabama 728351 730427 731556 740475
#> 2 Alaska 105333 132104 131166 131483
#> 3 Arizona 877696 1067210 1024454 990378
#> 4 Arkansas 449959 482114 483114 486157
#> 5 California 6050753 6169427 6202862 6178788
#> 6 Colorado 724349 842864 853610 863121
#> 7 Connecticut 562179 552919 543883 549295
#> 8 Delaware 114676 128342 128470 127791
#> 9 District of Columbia 68380 71263 72329 75411
#> 10 Florida 2434755 2641555 2668037 2691881
#> # ℹ 41 more rows
If we pivot the table as is, we can see the tot_
prefixes appearing in the year
column:
%>%
nces_table1 pivot_longer(
cols = -state, # pivot all columns except `state`
names_to = 'year', # name of column that holds the pivoted values
values_to = 'total_students' # name of column that holds the original cell values
)#> # A tibble: 204 × 3
#> state year total_students
#> <chr> <chr> <dbl>
#> 1 Alabama tot_2000 728351
#> 2 Alabama tot_2010 730427
#> 3 Alabama tot_2011 731556
#> 4 Alabama tot_2012 740475
#> 5 Alaska tot_2000 105333
#> 6 Alaska tot_2010 132104
#> 7 Alaska tot_2011 131166
#> 8 Alaska tot_2012 131483
#> 9 Arizona tot_2000 877696
#> 10 Arizona tot_2010 1067210
#> # ℹ 194 more rows
To fix this, we can specify the names_prefix
argument in pivot_longer()
to match and remove the part of the names that we don’t want to keep:
%>%
nces_table1 pivot_longer(
cols = -state,
names_to = 'year',
names_prefix = 'tot_', # rename pivoted values by striping the prefix 'tot_'
values_to = 'total_students'
)#> # A tibble: 204 × 3
#> state year total_students
#> <chr> <chr> <dbl>
#> 1 Alabama 2000 728351
#> 2 Alabama 2010 730427
#> 3 Alabama 2011 731556
#> 4 Alabama 2012 740475
#> 5 Alaska 2000 105333
#> 6 Alaska 2010 132104
#> 7 Alaska 2011 131166
#> 8 Alaska 2012 131483
#> 9 Arizona 2000 877696
#> 10 Arizona 2010 1067210
#> # ℹ 194 more rows
Example: Pivoting columns by name pattern
Consider the following example where the columns we want to reshape belongs to multiple variables. As seen, the 4 columns starting with tot_
shows total student enrollment while the 4 columns starting with frl_
shows number of students on free/reduced lunch:
nces_table2#> # A tibble: 51 × 9
#> state tot_2000 tot_2010 tot_2011 tot_2012 frl_2000 frl_2010 frl_2011 frl_2012
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Alab… 728351 730427 731556 740475 335143 402386 420447 429604
#> 2 Alas… 105333 132104 131166 131483 32468 50701 53238 53082
#> 3 Ariz… 877696 1067210 1024454 990378 274277 482044 511885 514193
#> 4 Arka… 449959 482114 483114 486157 205058 291608 294324 298573
#> 5 Cali… 6050753 6169427 6202862 6178788 2820611 3335885 3353964. 3478407
#> 6 Colo… 724349 842864 853610 863121 195148 336426 348896 358876
#> 7 Conn… 562179 552919 543883 549295 143030 190554 194339 201085
#> 8 Dela… 114676 128342 128470 127791 37766 61564 62774 66413
#> 9 Dist… 68380 71263 72329 75411 47839 52027 45199 46416
#> 10 Flor… 2434755 2641555 2668037 2691881 1079009 1479519 1535670 1576379
#> # ℹ 41 more rows
Applying the same approach we used for nces_table1
doesn’t give us the structure we want
%>%
nces_table2 pivot_longer(
cols = -state,
names_to = 'year',
names_prefix = 'tot_', # rename pivoted values by striping the prefix 'tot_'
values_to = 'total_students'
)#> # A tibble: 408 × 3
#> state year total_students
#> <chr> <chr> <dbl>
#> 1 Alabama 2000 728351
#> 2 Alabama 2010 730427
#> 3 Alabama 2011 731556
#> 4 Alabama 2012 740475
#> 5 Alabama frl_2000 335143
#> 6 Alabama frl_2010 402386
#> 7 Alabama frl_2011 420447
#> 8 Alabama frl_2012 429604
#> 9 Alaska 2000 105333
#> 10 Alaska 2010 132104
#> # ℹ 398 more rows
The goal is to keep enrollment and lunch data in separate columns and only pivot the years part. To do this, we can provide a character vector (instead of the usual string) for names_to
and additionally specify the names_sep
argument.
Here, we copy the full description of the names_to
and names_sep
arguments from the pivot_longer()
helpfile
names_sep
: If names_to contains multiple values, these arguments control how the column name is broken up.names_to
A string specifying the name of the column to create from the data stored in the column names of data.- can be a character vector, creating multiple columns, if names_sep or names_pattern is provided. In this case, there are two special values you can take advantage of:
NA
will discard that component of the name..value
indicates that component of the name defines the name of the column containing the cell values, overriding values_to.
- can be a character vector, creating multiple columns, if names_sep or names_pattern is provided. In this case, there are two special values you can take advantage of:
How we will use these arguments to reshape nces_table2
:
names_sep
specifies the separator to use to separate the column names to 2 parts (e.g., thetot
/frl
measure part and the years part).names_to
Then, we can specify how we want to treat each part inside thenames_to
vector:- Use
.value
to indicate the part that we want to retain as separate columns (this replaces the need for thevalues_to
argument) - Provide a string to indicate the part to pivot to rows, where the string you provide will become the name of that column (like how we’d normally specify
names_to
)
- Use
%>% head(n=5)
nces_table2 #> # A tibble: 5 × 9
#> state tot_2000 tot_2010 tot_2011 tot_2012 frl_2000 frl_2010 frl_2011 frl_2012
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Alaba… 728351 730427 731556 740475 335143 402386 420447 429604
#> 2 Alaska 105333 132104 131166 131483 32468 50701 53238 53082
#> 3 Arizo… 877696 1067210 1024454 990378 274277 482044 511885 514193
#> 4 Arkan… 449959 482114 483114 486157 205058 291608 294324 298573
#> 5 Calif… 6050753 6169427 6202862 6178788 2820611 3335885 3353964. 3478407
%>%
nces_table2 pivot_longer(
-state, # pivot all columns except `state`
names_sep = '_', # use '_' as column name separator
names_to = c('.value', 'year') # keep `tot` & `frl` as separate columns, pivot year values
)#> # A tibble: 204 × 4
#> state year tot frl
#> <chr> <chr> <dbl> <dbl>
#> 1 Alabama 2000 728351 335143
#> 2 Alabama 2010 730427 402386
#> 3 Alabama 2011 731556 420447
#> 4 Alabama 2012 740475 429604
#> 5 Alaska 2000 105333 32468
#> 6 Alaska 2010 132104 50701
#> 7 Alaska 2011 131166 53238
#> 8 Alaska 2012 131483 53082
#> 9 Arizona 2000 877696 274277
#> 10 Arizona 2010 1067210 482044
#> # ℹ 194 more rows
Note the difference if we used .value
as the second element in names_to
:
%>%
nces_table2 pivot_longer(
-state,
names_to = c('measure', '.value'), # pivot `tot`/`frl`, keep years as separate columns
names_sep = '_'
)#> # A tibble: 102 × 6
#> state measure `2000` `2010` `2011` `2012`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Alabama tot 728351 730427 731556 740475
#> 2 Alabama frl 335143 402386 420447 429604
#> 3 Alaska tot 105333 132104 131166 131483
#> 4 Alaska frl 32468 50701 53238 53082
#> 5 Arizona tot 877696 1067210 1024454 990378
#> 6 Arizona frl 274277 482044 511885 514193
#> 7 Arkansas tot 449959 482114 483114 486157
#> 8 Arkansas frl 205058 291608 294324 298573
#> 9 California tot 6050753 6169427 6202862 6178788
#> 10 California frl 2820611 3335885 3353964. 3478407
#> # ℹ 92 more rows
Practical example of using the pivot_longer()
function in the Appendix
4.2 Reshaping long to wide: pivot_wider()
The pivot_wider()
function:
?pivot_wider
# SYNTAX AND DEFAULT VALUES
pivot_wider(data, id_cols = NULL, names_from = name,
names_prefix = "", names_sep = "_", names_repair = "check_unique",
values_from = value, values_fill = NULL, values_fn = NULL)
- Function: “widens” data, increasing the number of columns and decreasing the number of rows
- Arguments:
data
: Dataframe to pivotnames_from
: Column(s) to get the name of the output columnvalues_from
: Column(s) to get the cell values from
Example: Tidying table2
(reshaping long to wide)
As seen previously, the second common reason for untidy data is that multiple variables are stored in one column (e.g., table2
):
- An observation is scattered across multiple rows
- One column identifies variable type (e.g.,
type
) and another column contains the values for each variable (e.g.,count
)
country | year | type | count |
---|---|---|---|
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
This sort of data structure is very common “in the wild” (e.g., in data that you download), and it is up to you to tidy it before analyses.
The solution to this problem is to transform the untidy rows (which represent different variables) into columns. Thus, we want to transform table2
into something that looks like this:
country | year | cases | population |
---|---|---|---|
Afghanistan | 1999 | 745 | 19987071 |
Afghanistan | 2000 | 2666 | 20595360 |
Brazil | 1999 | 37737 | 172006362 |
Brazil | 2000 | 80488 | 174504898 |
China | 1999 | 212258 | 1272915272 |
China | 2000 | 213766 | 1280428583 |
This can be achieved using pivot_wider()
:
table2#> # A tibble: 12 × 4
#> country year type count
#> <chr> <dbl> <chr> <dbl>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
%>%
table2 pivot_wider(
names_from = type, # values in `type` become variable names
values_from = count # values in `count` become values in the new `cases` and `population` cols
)#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <dbl> <dbl> <dbl>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
Example: Pivoting with multiple value variables
Sometimes, we want to spread a dataset that contains multiple value variables. Consider the following example where there are separate enrollment variables for total, men, and women:
ipeds_table1#> # A tibble: 15,726 × 5
#> unitid efytotlt efytotlm efytotlw level
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 100654 6157 2646 3511 all
#> 2 100654 4865 2258 2607 ug
#> 3 100654 1292 388 904 grad
#> 4 100663 21554 8383 13171 all
#> 5 100663 13440 5552 7888 ug
#> 6 100663 8114 2831 5283 grad
#> 7 100690 830 337 493 all
#> 8 100690 415 177 238 ug
#> 9 100690 415 160 255 grad
#> 10 100706 9206 5194 4012 all
#> # ℹ 15,716 more rows
The goal is to turn the values in levels
into columns - but which of the enrollment variables should be used as the values for the new columns? The solution is to create columns for each of the enrollment variables:
%>%
ipeds_table1 pivot_wider(
names_from = level,
values_from = c(efytotlt, efytotlm, efytotlw)
)#> # A tibble: 6,951 × 10
#> unitid efytotlt_all efytotlt_ug efytotlt_grad efytotlm_all efytotlm_ug
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 100654 6157 4865 1292 2646 2258
#> 2 100663 21554 13440 8114 8383 5552
#> 3 100690 830 415 415 337 177
#> 4 100706 9206 6994 2212 5194 3983
#> 5 100724 6101 5373 728 2321 2105
#> 6 100751 41507 35199 6308 18526 15946
#> 7 100760 2379 2379 NA 1027 1027
#> 8 100812 4124 4124 NA 1478 1478
#> 9 100830 5996 5072 924 2158 1847
#> 10 100858 29237 22953 6284 14893 11871
#> # ℹ 6,941 more rows
#> # ℹ 4 more variables: efytotlm_grad <dbl>, efytotlw_all <dbl>,
#> # efytotlw_ug <dbl>, efytotlw_grad <dbl>
Practical example of using the pivot_wider()
function in the Appendix
5 Missing values
5.1 Explicit vs. Implicit missing values
There are two types of missing values:
- Explicit missing values: variable has the value
NA
for a particular row - Implicit missing values: the row is simply not present in the data
Example: Identifying explicit and implicit missing values
Consider the following dataset, which shows stock return
for each year and quarter for some hypothetical company:
year | qtr | return |
---|---|---|
2015 | 1 | 1.88 |
2015 | 2 | 0.59 |
2015 | 3 | 0.35 |
2015 | 4 | NA |
2016 | 2 | 0.92 |
2016 | 3 | 0.17 |
2016 | 4 | 2.66 |
The variable return
has:
- 1 explicit missing value: There is an
NA
in the 4th quarter of 2015 - 1 implicit missing value: Data for the 1st quarter of 2016 simply does not exist
Practical example of identifying explicit and implicit missing values in the Appendix
5.2 Making implicit missing values explicit
An implicit missing value is the result of a row not existing. In order to make an implicit missing value explicit, we need to make the non-existent row exist. This can be done using the complete()
function within tidyr
.
The complete()
function:
?complete
# SYNTAX AND DEFAULT VALUES
complete(data, ..., fill = list())
- Function: Turns implicit missing values into explicit missing values
- Arguments:
data
: A dataframe...
: Variables to expand on
- Returns: A dataframe object that has all unique combinations of the specified variables, including those not found in the original dataframe
Example: Making implicit missing values explicit
Recall the stocks dataset from the previous example. There was 1 implicit missing value for the 1st quarter of 2016 - that row simply did not exist.
If we want to turn that into an explicit missing value, then we need the row for the 1st quarter of 2016 to exist. The result would look like this:
year | qtr | return |
---|---|---|
2015 | 1 | 1.88 |
2015 | 2 | 0.59 |
2015 | 3 | 0.35 |
2015 | 4 | NA |
2016 | 1 | NA |
2016 | 2 | 0.92 |
2016 | 3 | 0.17 |
2016 | 4 | 2.66 |
We can achieve this by using the complete()
function:
stocks#> # A tibble: 7 × 3
#> year qtr return
#> <dbl> <dbl> <dbl>
#> 1 2015 1 1.88
#> 2 2015 2 0.59
#> 3 2015 3 0.35
#> 4 2015 4 NA
#> 5 2016 2 0.92
#> 6 2016 3 0.17
#> 7 2016 4 2.66
%>%
stocks complete(year, qtr)
#> # A tibble: 8 × 3
#> year qtr return
#> <dbl> <dbl> <dbl>
#> 1 2015 1 1.88
#> 2 2015 2 0.59
#> 3 2015 3 0.35
#> 4 2015 4 NA
#> 5 2016 1 NA
#> 6 2016 2 0.92
#> 7 2016 3 0.17
#> 8 2016 4 2.66
Should you make implicit missing values explicit?
- No clear-cut answer - it depends on many context-specific things about your project
- The important thing is to be aware of the presence of implicit missing values (both in the “input” datasets you read in and the datasets you create from the inputs) and be purposeful about how you deal with implicit missing values
- Recommendation for the stage of creating analysis datasets from input data:
- If you feel unsure about making implicit values explicit, then make them explicit
- This forces you to be more fully aware of patterns of missing data and helps you avoid careless mistakes down the road
- After making implicit missing values explicit, you can drop these rows once you are sure you don’t need them
Practical example of using the complete()
function in the Appendix
5.3 Reshaping long to wide: missing values
Let’s take a look at what happens with missing values when we reshape from long to wide. Recall the stocks
dataset from the earlier example, where we have 1 explicit missing value (2015 4th quarter) and 1 implicit missing value (2016 1st quarter):
year | qtr | return |
---|---|---|
2015 | 1 | 1.88 |
2015 | 2 | 0.59 |
2015 | 3 | 0.35 |
2015 | 4 | NA |
2016 | 2 | 0.92 |
2016 | 3 | 0.17 |
2016 | 4 | 2.66 |
When we reshape it from long to wide using pivot_wider()
, notice that both implicit and explicit missing values appear as NA
. In other words, implicit missing values are automatically turned into explicit missing values. Therefore, there is no need to use complete()
prior to pivoting.
%>% pivot_wider(
stocks names_from = qtr,
values_from = return
)
year | 1 | 2 | 3 | 4 |
---|---|---|---|---|
2015 | 1.88 | 0.59 | 0.35 | NA |
2016 | NA | 0.92 | 0.17 | 2.66 |
Note that we can no longer tell whether the NA
s used to be implicit or explicit missing values, as they are now all explicit. One way to be able to distinguish between them is by specifying the values_fill
argument when calling pivot_wider()
. This will fill in the previously implicit values with any value of your choice, meaning any NA
you see after reshaping must be from the previously explicit missing values. In other words, explicit missing values stay explicit while implicit missing values are filled.
%>% pivot_wider(
stocks names_from = qtr,
values_from = return,
values_fill = list(return = -99)
)
year | 1 | 2 | 3 | 4 |
---|---|---|---|---|
2015 | 1.88 | 0.59 | 0.35 | NA |
2016 | -99.00 | 0.92 | 0.17 | 2.66 |
6 Appendix
6.1 Student exercise: Real-world example of wide to long
[In the task I present below, you have the option to work through the solution I have created or try doing it on your own before you look at solution]
The following dataset is drawn from Table 204.10 of the NCES Digest for Education Statistics.
- The table shows change over time in the number and percentage of K-12 students on free/reduced lunch for selected years.
- LINK to website with data
#Let's take a look at the data (we read in the data in the wide vs long section)
glimpse(nces_table)
#> Rows: 51
#> Columns: 13
#> $ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "…
#> $ tot_2000 <dbl> 728351, 105333, 877696, 449959, 6050753, 724349, 562179, 11…
#> $ tot_2010 <dbl> 730427, 132104, 1067210, 482114, 6169427, 842864, 552919, 1…
#> $ tot_2011 <dbl> 731556, 131166, 1024454, 483114, 6202862, 853610, 543883, 1…
#> $ tot_2012 <dbl> 740475, 131483, 990378, 486157, 6178788, 863121, 549295, 12…
#> $ frl_2000 <dbl> 335143, 32468, 274277, 205058, 2820611, 195148, 143030, 377…
#> $ frl_2010 <dbl> 402386, 50701, 482044, 291608, 3335885, 336426, 190554, 615…
#> $ frl_2011 <dbl> 420447.00, 53238.00, 511885.00, 294324.00, 3353963.72, 3488…
#> $ frl_2012 <dbl> 429604, 53082, 514193, 298573, 3478407, 358876, 201085, 664…
#> $ p_frl_2000 <dbl> 46.01394, 30.82415, 31.24966, 45.57260, 46.61587, 26.94116,…
#> $ p_frl_2010 <dbl> 55.08915, 38.37961, 45.16862, 60.48528, 54.07123, 39.91462,…
#> $ p_frl_2011 <dbl> 57.47298, 40.58826, 49.96662, 60.92227, 54.07123, 40.87300,…
#> $ p_frl_2012 <dbl> 58.01735, 40.37176, 51.91886, 61.41493, 56.29594, 41.57887,…
#Create smaller version of data frame for purpose of student exercise
<- nces_table %>%
totalselect(state,p_frl_2000,p_frl_2010,p_frl_2011,p_frl_2012)
head(total)
#> # A tibble: 6 × 5
#> state p_frl_2000 p_frl_2010 p_frl_2011 p_frl_2012
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Alabama 46.0 55.1 57.5 58.0
#> 2 Alaska 30.8 38.4 40.6 40.4
#> 3 Arizona 31.2 45.2 50.0 51.9
#> 4 Arkansas 45.6 60.5 60.9 61.4
#> 5 California 46.6 54.1 54.1 56.3
#> 6 Colorado 26.9 39.9 40.9 41.6
Task (using the data frame total
):
- Diagnose the problem with the data frame
total
- Sketch out what the tidy data should look like
- Transform untidy to tidy. hint: use
names_prefix
Solution to student exercise:
- Diagnose the problem with the data frame
total
- Column names
p_frl_2000
,p_frl_2010
, etc. are not variables; rather they refer to values of the variableyear
- Currently each observation represents a state with separate number of students on FRL variables for each year.
- Each observation should be a state-year, with only one variable for FRL
- Column names
- Sketch out what the tidy data should look like
- Transform untidy to tidy
- names of the set of columns that represent values, not variables in your untidy data
p_frl_2000,p_frl_2010,p_frl_2011,p_frl_2012
- names_to : variable name you will assign to columns you are gathering from the untidy data
- This var doesn’t yet exist in untidy data, but will be a variable name in the tidy data
- In this example, it’s year
- values_to: The name of the variable that will contain values in the tidy dataset you create and whose values are spread across multiple columns in the untidy dataset
- This var doesn’t yet exist in untidy data, but will be a variable name in the tidy data
- in this example, the value variable is frl_students
- names of the set of columns that represent values, not variables in your untidy data
%>%
total pivot_longer(
cols = starts_with("p_frl_"),
names_to = "year",
names_prefix = ("p_frl_"),
values_to = "pct_frl"
)#> # A tibble: 204 × 3
#> state year pct_frl
#> <chr> <chr> <dbl>
#> 1 Alabama 2000 46.0
#> 2 Alabama 2010 55.1
#> 3 Alabama 2011 57.5
#> 4 Alabama 2012 58.0
#> 5 Alaska 2000 30.8
#> 6 Alaska 2010 38.4
#> 7 Alaska 2011 40.6
#> 8 Alaska 2012 40.4
#> 9 Arizona 2000 31.2
#> 10 Arizona 2010 45.2
#> # ℹ 194 more rows
6.2 Student exercise: real-world example of reshaping long to wide
[In the task I present below, you have the option to work through the solution I have created or try doing it on your own before you look at solution]
The Integrated Postsecondary Education Data System (IPEDS) collects data on colleges and universities
- Below we load IPEDS data on 12-month enrollment headcount for 2015-16 academic year
#load these libraries if you haven't already
#library(haven)
#library(labelled)
# ipeds_table <- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/ipeds/effy/ey15-16_hc.dta", encoding=NULL)
Create smaller version of dataset
#ipeds_hc <- ipeds_table %>% select(instnm,unitid,lstudy,efytotlt,efytotlm,efytotlw)
<- ipeds_table %>% select(instnm,unitid,lstudy,efytotlt) ipeds_hc
Get to know data
head(ipeds_hc)
#> # A tibble: 6 × 4
#> instnm unitid lstudy efytotlt
#> <chr> <dbl> <dbl+lbl> <dbl>
#> 1 Alabama A & M University 100654 999 [Generated total] 6157
#> 2 Alabama A & M University 100654 1 [Undergraduate] 4865
#> 3 Alabama A & M University 100654 3 [Graduate] 1292
#> 4 University of Alabama at Birmingham 100663 999 [Generated total] 21554
#> 5 University of Alabama at Birmingham 100663 1 [Undergraduate] 13440
#> 6 University of Alabama at Birmingham 100663 3 [Graduate] 8114
str(ipeds_hc)
#> tibble [15,726 × 4] (S3: tbl_df/tbl/data.frame)
#> $ instnm : chr [1:15726] "Alabama A & M University" "Alabama A & M University" "Alabama A & M University" "University of Alabama at Birmingham" ...
#> ..- attr(*, "label")= chr "Institution (entity) name"
#> ..- attr(*, "format.stata")= chr "%91s"
#> $ unitid : num [1:15726] 100654 100654 100654 100663 100663 ...
#> ..- attr(*, "label")= chr "Unique identification number of the institution"
#> ..- attr(*, "format.stata")= chr "%12.0g"
#> $ lstudy : dbl+lbl [1:15726] 999, 1, 3, 999, 1, 3, 999, 1, 3, 999, ...
#> ..@ label : chr "Original level of study on survey form"
#> ..@ format.stata: chr "%15.0g"
#> ..@ labels : Named num [1:3] 1 3 999
#> .. ..- attr(*, "names")= chr [1:3] "Undergraduate" "Graduate" "Generated total"
#> $ efytotlt: num [1:15726] 6157 4865 1292 21554 13440 ...
#> ..- attr(*, "label")= chr "Grand total"
#> ..- attr(*, "format.stata")= chr "%12.0g"
#> - attr(*, "label")= chr "dct_s2015_is"
#Variable labels
%>% var_label()
ipeds_hc #> $instnm
#> [1] "Institution (entity) name"
#>
#> $unitid
#> [1] "Unique identification number of the institution"
#>
#> $lstudy
#> [1] "Original level of study on survey form"
#>
#> $efytotlt
#> [1] "Grand total"
#Only the variable lstudy has value labels
%>% select(lstudy) %>% val_labels()
ipeds_hc #> $lstudy
#> Undergraduate Graduate Generated total
#> 1 3 999
Student Task:
- Diagnose the problem with the data frame
ipeds_hc
(why is it untidy?) - Sketch out what the tidy data should look like
- Transform untidy to tidy
Solution to student task:
1. Diagnose the problem with the data frame - First, let’s investigate the data
<- ipeds_hc %>% arrange(unitid, lstudy)
ipeds_hc head(ipeds_hc, n=20)
#> # A tibble: 20 × 4
#> instnm unitid lstudy efytotlt
#> <chr> <dbl> <dbl+lbl> <dbl>
#> 1 Alabama A & M University 100654 1 [Undergraduate] 4865
#> 2 Alabama A & M University 100654 3 [Graduate] 1292
#> 3 Alabama A & M University 100654 999 [Generated total] 6157
#> 4 University of Alabama at Birmingham 100663 1 [Undergraduate] 13440
#> 5 University of Alabama at Birmingham 100663 3 [Graduate] 8114
#> 6 University of Alabama at Birmingham 100663 999 [Generated total] 21554
#> 7 Amridge University 100690 1 [Undergraduate] 415
#> 8 Amridge University 100690 3 [Graduate] 415
#> 9 Amridge University 100690 999 [Generated total] 830
#> 10 University of Alabama in Huntsville 100706 1 [Undergraduate] 6994
#> 11 University of Alabama in Huntsville 100706 3 [Graduate] 2212
#> 12 University of Alabama in Huntsville 100706 999 [Generated total] 9206
#> 13 Alabama State University 100724 1 [Undergraduate] 5373
#> 14 Alabama State University 100724 3 [Graduate] 728
#> 15 Alabama State University 100724 999 [Generated total] 6101
#> 16 The University of Alabama 100751 1 [Undergraduate] 35199
#> 17 The University of Alabama 100751 3 [Graduate] 6308
#> 18 The University of Alabama 100751 999 [Generated total] 41507
#> 19 Central Alabama Community College 100760 1 [Undergraduate] 2379
#> 20 Central Alabama Community College 100760 999 [Generated total] 2379
#code to investigate what each observation represents
#I'll break this code down next week when we talk about joining data frames
%>% group_by(unitid,lstudy) %>% # group_by our candidate
ipeds_hc mutate(n_per_id=n()) %>% # calculate number of obs per group
ungroup() %>% # ungroup the data
count(n_per_id==1) # count "true that only one obs per group"
#> # A tibble: 1 × 2
#> `n_per_id == 1` n
#> <lgl> <int>
#> 1 TRUE 15726
Summary of problems with the data frame:
- In the untidy data frame, each row represents college-level_of_study
- there are separate rows for each value of level of study (
undergraduate
,graduate
,generated total
) - so three rows for each college
- there are separate rows for each value of level of study (
- the values of the column
lstudy
represent different attributes (undergraduate
,graduate
,generated total
)- each of these attributes should be its own variable
2. Sketch out what the tidy data should look like (sketch out on your own)
- What tidy data should look like:
- Each observation (row) should be a college
- There should be separate variables for each level of study, with each variable containing enrollment for that level of study
3. Transform untidy to tidy
- names_from. Column name in the untidy data whose values will become variable names in the tidy data that contains variable names
- this variable name exists in the untidy data
- in
ipeds_hc
the key column islstudy
; the values oflstudy
, will become variable names in the tidy data
- values_from. Column name in untidy data that contains values for the new variables that will be created in the tidy data
- this is a varname that exists in the untidy data
- in
ipeds_hc
the value column isefytotlt
; the values ofefytotlt
will become the values of the new variables in the tidy data
%>%
ipeds_hc pivot_wider(names_from = lstudy, values_from = efytotlt) #notice it uses the underlying data not labels
#> # A tibble: 6,951 × 5
#> instnm unitid `1` `3` `999`
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Alabama A & M University 100654 4865 1292 6157
#> 2 University of Alabama at Birmingham 100663 13440 8114 21554
#> 3 Amridge University 100690 415 415 830
#> 4 University of Alabama in Huntsville 100706 6994 2212 9206
#> 5 Alabama State University 100724 5373 728 6101
#> 6 The University of Alabama 100751 35199 6308 41507
#> 7 Central Alabama Community College 100760 2379 NA 2379
#> 8 Athens State University 100812 4124 NA 4124
#> 9 Auburn University at Montgomery 100830 5072 924 5996
#> 10 Auburn University 100858 22953 6284 29237
#> # ℹ 6,941 more rows
Alternative solution:
Helpful to create a character version of variable lstudy prior to spreading
%>% select(lstudy) %>% val_labels()
ipeds_hc #> $lstudy
#> Undergraduate Graduate Generated total
#> 1 3 999
str(ipeds_hc$lstudy)
#> dbl+lbl [1:15726] 1, 3, 999, 1, 3, 999, 1, 3, 999, 1, 3, 9...
#> @ label : chr "Original level of study on survey form"
#> @ format.stata: chr "%15.0g"
#> @ labels : Named num [1:3] 1 3 999
#> ..- attr(*, "names")= chr [1:3] "Undergraduate" "Graduate" "Generated total"
<- ipeds_hc %>%
ipeds_hcv2 mutate(level = recode(as.integer(lstudy),
`1` = "ug",
`3` = "grad",
`999` = "all")
%>% select(-lstudy) # drop variable lstudy
)
head(ipeds_hcv2)
#> # A tibble: 6 × 4
#> instnm unitid efytotlt level
#> <chr> <dbl> <dbl> <chr>
#> 1 Alabama A & M University 100654 4865 ug
#> 2 Alabama A & M University 100654 1292 grad
#> 3 Alabama A & M University 100654 6157 all
#> 4 University of Alabama at Birmingham 100663 13440 ug
#> 5 University of Alabama at Birmingham 100663 8114 grad
#> 6 University of Alabama at Birmingham 100663 21554 all
%>% select(instnm,unitid,level,efytotlt) %>%
ipeds_hcv2 pivot_wider(names_from = level, values_from = efytotlt) #nicer!
#> # A tibble: 6,951 × 5
#> instnm unitid ug grad all
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Alabama A & M University 100654 4865 1292 6157
#> 2 University of Alabama at Birmingham 100663 13440 8114 21554
#> 3 Amridge University 100690 415 415 830
#> 4 University of Alabama in Huntsville 100706 6994 2212 9206
#> 5 Alabama State University 100724 5373 728 6101
#> 6 The University of Alabama 100751 35199 6308 41507
#> 7 Central Alabama Community College 100760 2379 NA 2379
#> 8 Athens State University 100812 4124 NA 4124
#> 9 Auburn University at Montgomery 100830 5072 924 5996
#> 10 Auburn University 100858 22953 6284 29237
#> # ℹ 6,941 more rows
6.3 Missing variables example using IPEDS
This section deals with missing variables and tidying data.
But first, it is helpful to create a new version of the IPEDS enrollment dataset as follows:
keeps observations for for-profit colleges
keeps the following enrollment variables:
- total enrollment
- enrollment of students who identify as “Black or African American”
<- ipeds_table %>% filter(sector %in% c(3,6,9)) %>% #keep only for-profit colleges ipeds_hc_na select(instnm,unitid,lstudy,efytotlt,efybkaam) %>% mutate(level = recode(as.integer(lstudy), # create recoded version of lstudy `1` = "ug", `3` = "grad", `999` = "all") %>% select(instnm,unitid,level,efytotlt,efybkaam) %>% ) arrange(unitid,desc(level))
Now, let’s print some rows
- There is one row for each college-level_of_study
- Some colleges have three rows of data (ug, grad, all)
- Colleges that don’t have any undergraduates or don’t have any graduate students only have two rows of data
ipeds_hc_na#> # A tibble: 6,265 × 5
#> instnm unitid level efytotlt efybkaam
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 South University-Montgomery 101116 ug 777 122
#> 2 South University-Montgomery 101116 grad 218 40
#> 3 South University-Montgomery 101116 all 995 162
#> 4 New Beginning College of Cosmetology 101277 ug 132 0
#> 5 New Beginning College of Cosmetology 101277 all 132 0
#> 6 Herzing University-Birmingham 101365 ug 675 73
#> 7 Herzing University-Birmingham 101365 grad 15 5
#> 8 Herzing University-Birmingham 101365 all 690 78
#> 9 Prince Institute-Southeast 101958 ug 34 0
#> 10 Prince Institute-Southeast 101958 all 34 0
#> # ℹ 6,255 more rows
Now let’s create new versions of the enrollment variables, that replace 0
with NA
<- ipeds_hc_na %>%
ipeds_hc_na mutate(
efytotltv2 = ifelse(efytotlt == 0, NA, efytotlt),
efybkaamv2 = ifelse(efybkaam == 0, NA, efybkaam)
%>% select(instnm,unitid,level,efytotlt,efytotltv2,efybkaam,efybkaamv2)
)
%>% select(unitid,level,efytotlt,efytotltv2,efybkaam,efybkaamv2)
ipeds_hc_na #> # A tibble: 6,265 × 6
#> unitid level efytotlt efytotltv2 efybkaam efybkaamv2
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 101116 ug 777 777 122 122
#> 2 101116 grad 218 218 40 40
#> 3 101116 all 995 995 162 162
#> 4 101277 ug 132 132 0 NA
#> 5 101277 all 132 132 0 NA
#> 6 101365 ug 675 675 73 73
#> 7 101365 grad 15 15 5 5
#> 8 101365 all 690 690 78 78
#> 9 101958 ug 34 34 0 NA
#> 10 101958 all 34 34 0 NA
#> # ℹ 6,255 more rows
Create dataset that drops the original enrollment variables, keeps enrollment vars that replace 0
with NA
<- ipeds_hc_na %>% select(-efytotlt,-efybkaam) ipeds_hc_nav2
Now we can introduce the concepts of explicit and implicit missing values
There are two types of missing values:
- Explicit missing values: variable has the value
NA
for a particular row - Implicit missing values: the row is simply not present in the data
Let’s print data for the first two colleges
%>% head( n=5)
ipeds_hc_nav2 #> # A tibble: 5 × 5
#> instnm unitid level efytotltv2 efybkaamv2
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 South University-Montgomery 101116 ug 777 122
#> 2 South University-Montgomery 101116 grad 218 40
#> 3 South University-Montgomery 101116 all 995 162
#> 4 New Beginning College of Cosmetology 101277 ug 132 NA
#> 5 New Beginning College of Cosmetology 101277 all 132 NA
South University-Montgomery
has three rows:
- variable
efytotltv2
has0
explicit missing values and0
implicit missing values - variable
efybkaamv2
has0
explicit missing values and0
implicit missing values
New Beginning College of Cosmetology
has two rows (because they have no graduate students):
- variable
efytotltv2
has0
explicit missing values and1
implicit missing values (no row for grad students) - variable
efybkaamv2
has2
explicit missing values and1
implicit missing values (no row for grad students)
6.4 Applying complete() to IPEDS dataset
Let’s apply complete()
to our IPEDS dataset
ipeds_hc_nav2#> # A tibble: 6,265 × 5
#> instnm unitid level efytotltv2 efybkaamv2
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 South University-Montgomery 101116 ug 777 122
#> 2 South University-Montgomery 101116 grad 218 40
#> 3 South University-Montgomery 101116 all 995 162
#> 4 New Beginning College of Cosmetology 101277 ug 132 NA
#> 5 New Beginning College of Cosmetology 101277 all 132 NA
#> 6 Herzing University-Birmingham 101365 ug 675 73
#> 7 Herzing University-Birmingham 101365 grad 15 5
#> 8 Herzing University-Birmingham 101365 all 690 78
#> 9 Prince Institute-Southeast 101958 ug 34 NA
#> 10 Prince Institute-Southeast 101958 all 34 NA
#> # ℹ 6,255 more rows
<- ipeds_hc_nav2 %>% select(unitid,level,efytotltv2,efybkaamv2) %>%
ipeds_complete complete(unitid, level)
ipeds_complete#> # A tibble: 9,063 × 4
#> unitid level efytotltv2 efybkaamv2
#> <dbl> <chr> <dbl> <dbl>
#> 1 101116 all 995 162
#> 2 101116 grad 218 40
#> 3 101116 ug 777 122
#> 4 101277 all 132 NA
#> 5 101277 grad NA NA
#> 6 101277 ug 132 NA
#> 7 101365 all 690 78
#> 8 101365 grad 15 5
#> 9 101365 ug 675 73
#> 10 101958 all 34 NA
#> # ℹ 9,053 more rows
#Confirm that the "complete" dataset always has three observations per unitid
%>% group_by(unitid) %>% summarise(n=n()) %>% count(n)
ipeds_complete #> Storing counts in `nn`, as `n` already present in input
#> ℹ Use `name = "new_name"` to pick a new name.
#> # A tibble: 1 × 2
#> n nn
#> <int> <int>
#> 1 3 3021
#Note that previous dataset did not
%>% group_by(unitid) %>% summarise(n=n()) %>% count(n)
ipeds_hc_nav2 #> Storing counts in `nn`, as `n` already present in input
#> ℹ Use `name = "new_name"` to pick a new name.
#> # A tibble: 2 × 2
#> n nn
#> <int> <int>
#> 1 2 2798
#> 2 3 223