library(tidyverse)
library(haven)
library(labelled)
0.0.1 What we will do today
1 Introduction
1.0.1 What is exploratory data analysis (EDA)?
The Towards Data Science website has a nice definition of EDA:
“Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics”
This course focuses on “data management”:
- investigating and cleaning data for the purpose of creating analysis variables
- Basically, everything that happens before you conduct analyses
I think about “exploratory data analysis for data quality”
- Investigating values and patterns of variables from “input data”
- Identifying and cleaning errors or values that need to be changed
- Creating analysis variables
- Checking values of analysis variables agains values of input variables
1.0.2 How we will teach exploratory data analysis
Will teach exploratory data analysis (EDA) in two sub-sections:
- Provide “Guidelines for EDA”
- Less about coding, more about practices you should follow and mentality necessary to ensure high data quality
- Introduce “Tools of EDA”:
- Demonstrate code to investigate variables and relatioship between variables
- Most of these tools are just the application of programming skills you have already learned
1.0.3 Libraries
“Load” the package we will use today (output omitted)
- you must run this code chunk after installing these packages
If package not yet installed, then must install before you load. Install in “console” rather than .Rmd file
- Generic syntax:
install.packages("package_name")
- Install “tidyverse”:
install.packages("tidyverse")
Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes:
install.packages("tidyverse")
library(tidyverse)
1.0.4 Data we will use
Use read_dta()
function from haven
to import Stata dataset into R
<- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/hsls/hsls_stu_small.dta") hsls
Let’s examine the data [you must run this code chunk]
%>% names()
hsls %>% names() %>% str()
hsls %>% names() %>% tolower() %>% str()
hsls
names(hsls) <- tolower(names(hsls)) # convert names to lowercase
names(hsls)
str(hsls) # ugh
str(hsls$s3classes)
attributes(hsls$s3classes)
typeof(hsls$s3classes)
class(hsls$s3classes)
Download the HSLS Codebook: https://nces.ed.gov/pubs2014/2014361_AppendixI.pdf
1.0.5 Data we will use
First, Let’s create a smaller version of the HSLS:09 dataset
#hsls %>% var_label()
<- hsls %>%
hsls_small select(stu_id,x3univ1,x3sqstat,x4univ1,x4sqstat,s3classes,
s3work,s3focus,s3clgft,s3workft,s3clgid,s3clgcntrl,
s3clglvl,s3clgsel,s3clgstate,s3proglevel,x4evrappclg,
x4evratndclg,x4atndclg16fb,x4ps1sector,x4ps1level,
x4ps1ctrl,x4ps1select,x4refsector,x4reflevel,x4refctrl, x4refselect, x2sex,x2race,x2paredu,x2txmtscor,x4x2ses,x4x2sesq5)
names(hsls_small)
%>% var_label() hsls_small
2 Guidelines for exploratory data analysis (EDA)
2.0.1 Rule #1 for data quality: DATA BETTER BE RIGHT
-
- Sometimes social justice is creating a GPA variable from course-level data
- Very politically charged issue; would’ve been bad if we didn’t get the data right
-
- They didn’t get the data right; I took them to task
-
- I didn’t get the data right; I got taken to task
Salazar, K. G., Jaquette, O., & Han, C. (2021). Coming Soon to a Neighborhood Near You? Off-Campus Recruiting by Public Research Universities. American Educational Research Journal, 58(6), 1270-1314
- Karina/Crystal/Patricia spent thousands of hours getting the data right
2.0.2 Rule #1: DATA BETTER BE RIGHT (for grad students)
Researchers who develop a reputation for always geting the data right are the ones who always get research funding
- Grad students working on research projects are usually the front-line of getting the data right
The virtuous circle of getting the data right, delivering on deliverables, and inter-generational grad student opportunity
- you don’t pay it forward later, you pay it forward now with the quality of work you do now
- sometimes you pay it forward even when you didn’t get the opportunities you deserve
Challenges that arise
- Principal investigators who don’t have respect for getting the data right
- Not enough time or resources to get the data right
2.0.3 Guidelines for “EDA for data quality”
Assme that your goal in “EDA for data quality” is to investigate “input” data sources and create “analysis variables”
- Usually, your analysis dataset will incorporate multiple sources of input data, including data you collect (primary data) and/or data collected by others (secondary data)
While this is not a linear process, these are the broad steps I follow
- Understand how input data sources were created
- e.g., when working with survey data, have survey questionnaire and codebooks on hand
- watch out for skip patterns!!!
- For each input data source, identify the “unit of analysis” and which combination of variables uniquely identify observations
- Investigate patterns in input variables
- Create analysis variable from input variable(s)
- Verify that analysis variable is created correctly through descriptive statistics that compare values of input variable(s) against values of the analysis variable
Always be aware of missing values
- They will not always be coded as
NA
in input variables
2.0.4 “Unit of analysis” and which variables uniquely identify observations
“Unit of analysis” refers to “what does each observation represent” in an input data source
- If each obs represents a student, you have “student level data”
- If each obs represents a student-course, you have “student-course level data”
- If each obs represents a school, you have “school-level data”
- If each obs represents a school-year, you have “school-year level data”
How to identify unit of analysis
- data documentation
- investigating the data set
We will go over syntax for identifying unit of analysis in subsequent weeks
2.0.5 Rules for variable creation
Rules I follow for variable creation
- Never modify “input variable”; instead create new variable based on input variable(s)
- Always keep input variables used to create new variables
- Investigate input variable(s) and relationship between input variables
- Developing a plan for creation of analysis variable
- e.g., for each possible value of input variables, what should value of analysis variable be?
- Write code to create analysis variable
- Run descriptive checks to verify new variables are constructed correctly
- Can “comment out” these checks, but don’t delete them
- Document new variables with notes and labels
2.0.6 Rules for variable creation
Task:
- Create analysis for variable ses qunitile called
sesq5
based onx4x2sesq5
that converts negative values toNAs
[some of the code here will be covered in the next section]
#investigate input variable
%>% select(x4x2sesq5) %>% var_label()
hsls_small %>% select(x4x2sesq5) %>% val_labels()
hsls_small %>% select(x4x2sesq5) %>% count(x4x2sesq5)
hsls_small %>% select(x4x2sesq5) %>% count(x4x2sesq5) %>% as_factor()
hsls_small
#create analysis variable
<- hsls_small %>%
hsls_small_temp mutate(sesq5=ifelse(x4x2sesq5==-8,NA,x4x2sesq5)) # approach 1
<- hsls_small %>%
hsls_small_temp mutate(sesq5=ifelse(x4x2sesq5<0,NA,x4x2sesq5)) # approach 2
#verify
%>% group_by(x4x2sesq5) %>% count(sesq5) hsls_small_temp
2.0.7 Skip patterns in survey data
Pretty easy to create an analysis variable based on a single input variable
Harder to create analysis variables based on multiple input variables
- When working with survey data, even seemingly simple analysis variables require multiple input variables due to “skip patterns”
What are “skip patterns”?
- Response on a particular survey item determines whether respondent answers some set of subsequent questions
- What are some examples of this?
Key to working with skip patterns
- Have the survey questionnaire on hand
- Sometimes it appears that analysis variable requires only one input variable, but really depends on several input variables because of skip patterns
- Don’t just blindly turn “missing” and “skips” from survey data to
NAs
in your analysis variable - Rather, trace why these “missing” and “skips” appear and decide how they should be coded in your analysis variable
- Don’t just blindly turn “missing” and “skips” from survey data to
See appendix of this lecture for more detail on working with skip patterns
2.0.8 Problem set due next week
Assignment:
- create GPA from postsecondary transcript student-course level data
- last week’s problem set created the input var: numgrade_v2
- this wek you are responsible for developing plan to create GPA vars and for executing plan (rather than us giving you step-by-step quations)
Why this assignment?
- Give you more practice investigating data, cleaning data, creating variables that require processing across rows
- Real world example of “simple” task with complex data management needs
Data source: National Longitudinal Study of 1972 (NLS72)
- Follows 12th graders from 1972
- Base year: 1972
- Follow-up surveys in: 1973, 1974, 1976, 1979, 1986
- Postsecondary transcripts collected in 1984
- Why use such an old survey for this assignment?
- NLS72 predates data privacy agreements; transcript data publicly available
Recommendation for making decisions in presence of “dirty data”
- Spend some reasonable amount of time investigating data patterns
- Make decisions about what to when variable values seem problematic (e.g., a course is worth 30 credits)
- DON’T make (subjective) decisions one observation at a time; rather, decide on rules that will apply across all observations
- you can explain rationale for your decisions in comments of your code
3 Tools for exploratory data analysis (EDA)
3.0.1 Tools of EDA
To do EDA for data quality, must master the following tools:
- *Select, sort, filter, and print** in order to see data patterns, anomolies
- Select and sort particular values of particular variables
- Print particular values of particular variables
- One-way descriptive analyses (i.e,. focus on one variable)
- Descriptive analyses for continuous variables
- Descriptive analyses for discreet/categorical variables
- Two-way descriptive analyses (relationship between two variables)
- Categorical by categorical
- Categorical by continuous
- Continuous by continuous
Whenever using any of these tools, pay close attention to missing values and how they are coded
- Often, the “input” variables don’t code missing values as
NA
- Especially when working with survey data, missing values coded as a negative number (e.g.,
-9
,-8
,-4
) with different negative values representing different reasons for data being missing - sometimes missing values coded as very high positive numbers
- Therefore, important to investigate input vars prior to creating analysis vars
3.0.2 Tools of EDA, caveat (using graphs to investigate variables)
Caveat on tools of EDA introduced in this lecture
- Usually you can investigate categorical variables pretty effectively by creating table of frequency counts
- More difficult to investigate continuous variables or relationship between continuous variable and another variable
- too many different values for frequency counts
- summary statistics (e.g., min, max) sometimes not very effective
- Graphs can be a great way to investigate continuous variables
- also effective for investigating relationship between continuous variable and some other variable
- but we do not introduce graphs until beginning of EDUC260B (Rclass2)
3.0.3 Caveat (using base R ifelse()
rather than Tidyverse if_else()
)
The below slides use the base R ifelse()
function rather than Tidyverse if_else()
function to create variables
For example, below code uses
ifelse()
to create a version of the variables3classes
that replaces values less than zero withNA
- Note that the new variable we create
s3classes_na
is a numeric class variable, rather than labelled class, so it won’t have value labels
%>% mutate(s3classes_na=ifelse(s3classes<0,NA,s3classes)) %>% hsls_small select(s3classes,s3classes_na) %>% glimpse()
- Note that the new variable we create
Why not use Tidyverse if_else()
?
below code won’t work because
if_else(<condition>,<true>,<false>)
expects values assigned when condition isTRUE
to have the same class as values assigned when condition isFALSE
- but there is no variation of
NA
that has the labelled class, so R throws an error
%>% mutate(s3classes_na=if_else(s3classes<0,NA,s3classes)) hsls_small %>% mutate(s3classes_na=if_else(s3classes<0,NA_real_,s3classes)) hsls_small
3.0.4 Caveat (using base R
ifelse()
rather than Tidyverseif_else()
)- but there is no variation of
This “stack overflow” (LINK) post explains some solutions to get around the problem
- We don’t implement any of these solutions because we want to keep the code as simple as possible
The downside of using ifelse
is that the new variable has numeric class rather than labelled class; we lose value labels (and variable labels)
- One way around this, is to assign the
labelled()
function from “haven” package to assign “labels” attribute from the input variable
<- hsls_small %>% mutate(s3classes_na=ifelse(s3classes<0,NA,s3classes))
hsls_small_temp
attributes(hsls_small_temp$s3classes_na) # null
attributes(hsls_small_temp$s3classes) # labelled class
$s3classes_na <- labelled(
hsls_small_temp$s3classes_na,
hsls_small_templabels = attr(x=hsls_small_temp$s3classes, which = "labels"),
label = attr(x=hsls_small_temp$s3classes, which = "label")
)%>% select(s3classes,s3classes_na) %>% glimpse()
hsls_small_temp %>% count(s3classes_na)
hsls_small_temp rm(hsls_small_temp)
3.0.5 Tools of EDA: select, sort, filter, and print
We’ve already know select()
, arrange()
, filter()
Select, sort, and print specific vars
#sort and print
%>% arrange(desc(stu_id)) %>%
hsls_small select(stu_id,x3univ1,x3sqstat,s3classes,s3clglvl)
#investigate variable attributes
%>% arrange(desc(stu_id)) %>%
hsls_small select(stu_id,x3univ1,x3sqstat,s3classes,s3clglvl) %>% str()
#print observations with value labels rather than variable values
%>% arrange(desc(stu_id)) %>%
hsls_small select(stu_id,x3univ1,x3sqstat,s3classes,s3clglvl) %>% as_factor()
Sometimes helpful to increase the number of observations printed
class(hsls_small) #it's a tibble, which is the "tidyverse" version of a data frame
options(tibble.print_min=50)
# execute this in console
%>% arrange(desc(stu_id)) %>%
hsls_small select(stu_id,x3univ1,x3sqstat,s3classes,s3clglvl)
options(tibble.print_min=10) # set default printing back to 10 lines
3.0.6 One-way descriptive stats for continuous vars, Base R approach [SKIP]
mean(hsls_small$x2txmtscor)
sd(hsls_small$x2txmtscor)
#Careful: summary stats include value of -8!
min(hsls_small$x2txmtscor)
max(hsls_small$x2txmtscor)
Be careful with NA
values
#Create variable replacing -8 with NA
<- hsls_small %>%
hsls_small_temp mutate(x2txmtscorv2=ifelse(x2txmtscor==-8,NA,x2txmtscor))
%>% filter(is.na(x2txmtscorv2)) %>% count(x2txmtscorv2)
hsls_small_temp
mean(hsls_small_temp$x2txmtscorv2)
mean(hsls_small_temp$x2txmtscorv2, na.rm=TRUE)
rm(hsls_small_temp)
3.0.7 One-way descriptive stats for continuous vars, Tidyverse approach
Use summarise_at()
, a variation of summarise()
, to make descriptive stats
.args=list(na.rm=TRUE)
= a named list of additional arguments to be added to all function calls
Task:
- calculate descriptive stats for
x2txmtscor
, math test score
#?summarise_at
%>% select(x2txmtscor) %>% var_label()
hsls_small #> $x2txmtscor
#> [1] "X2 Mathematics standardized theta score"
%>%
hsls_small summarise_at(
.vars = vars(x2txmtscor),
.funs = funs(mean, sd, min, max, .args=list(na.rm=TRUE))
)#> # A tibble: 1 × 4
#> mean sd min max
#> <dbl> <dbl> <dbl> <dbl>
#> 1 44.1 21.8 -8 84.9
3.0.8 One-way descriptive stats for continuous vars, Tidyverse approach
Can calculate descriptive stats for more than one variable at a time
Task:
- calculate descriptive stats for
x2txmtscor
, math test score, andx4x2ses
, socioeconomic index score
%>% select(x2txmtscor,x4x2ses) %>% var_label()
hsls_small #> $x2txmtscor
#> [1] "X2 Mathematics standardized theta score"
#>
#> $x4x2ses
#> [1] "X4 Revised X2 Socio-economic status composite"
%>%
hsls_small summarise_at(
.vars = vars(x2txmtscor,x4x2ses),
.funs = funs(mean, sd, min, max, .args=list(na.rm=TRUE))
)#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Warning: Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> # A tibble: 1 × 8
#> x2txmtscor_mean x4x2ses_mean x2txmtscor_sd x4x2ses_sd x2txmtscor_min
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 44.1 -0.802 21.8 2.63 -8
#> # ℹ 3 more variables: x4x2ses_min <dbl>, x2txmtscor_max <dbl>,
#> # x4x2ses_max <dbl>
3.0.9 One-way descriptive stats for continuous vars, Tidyverse approach
“Input vars” in survey data often have negative values for missing/skips
%>% filter(x2txmtscor<0) %>% count(x2txmtscor) hsls_small
R includes those negative values when calculating stats; you don’t want this
- Solution: create version of variable that replaces negative values with
NA
%>% mutate(x2txmtscor_na=ifelse(x2txmtscor<0,NA,x2txmtscor)) %>%
hsls_small summarise_at(
.vars = vars(x2txmtscor_na),
.funs = funs(mean, sd, min, max, .args=list(na.rm=TRUE))
)#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Warning: Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> # A tibble: 1 × 4
#> mean sd min max
#> <dbl> <dbl> <dbl> <dbl>
#> 1 51.5 10.2 22.2 84.9
What if you didn’t include .args=list(na.rm=TRUE)
?
%>% mutate(x2txmtscor_na=ifelse(x2txmtscor<0,NA,x2txmtscor)) %>%
hsls_small summarise_at(
.vars = vars(x2txmtscor_na),
.funs = funs(mean, sd, min, max))
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Warning: Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> # A tibble: 1 × 4
#> mean sd min max
#> <dbl> <dbl> <dbl> <dbl>
#> 1 NA NA NA NA
3.0.10 One-way descriptive stats for continuous vars, Tidyverse approach
How to identify these missing/skip values if you don’t have a codebook?
count()
combined withfilter()
helpful for finding extreme values of continuous vars, which are often associated with missing or skip
#variable x2txmtscor
%>% filter(x2txmtscor<0) %>%
hsls_small count(x2txmtscor)
#> # A tibble: 1 × 2
#> x2txmtscor n
#> <dbl> <int>
#> 1 -8 2909
#variable s3clglvl
%>% select(s3clglvl) %>% var_label()
hsls_small #> $s3clglvl
#> [1] "S3 Enrolled college IPEDS level"
%>% filter(s3clglvl<0) %>%
hsls_small count(s3clglvl)
#> # A tibble: 3 × 2
#> s3clglvl n
#> <dbl+lbl> <int>
#> 1 -9 [Missing] 487
#> 2 -8 [Unit non-response] 4945
#> 3 -7 [Item legitimate skip/NA] 5022
3.0.11 One-way descriptive stats student exercise
- Using the object
hsls
, identify variable type, variable class, and check the variable values and value labels ofx4ps1start
- variable
x4ps1start
identifies month and year student first started postsecondary education - Note: This variable is a bit counterintuitive.
- e.g., the value
201105
refers to May 2011
- e.g., the value
- variable
- Get a frequency count of the variable
x4ps1start
- Get a frequency count of the variable, but this time only observations that have negative values hint: use filter()
- Create a new version of the variable
x4ps1start_na
that replaces negative values with NAs and usesummarise_at()
to get the min and max value.
3.0.12 One-way descriptive stats student exercise solutions
- Using the object
hsls
, identify variable type, variable class, and check the variable vakyes and value labels ofx4ps1start
typeof(hsls$x4ps1start)
#> [1] "double"
class(hsls$x4ps1start)
#> [1] "haven_labelled" "vctrs_vctr" "double"
%>% select(x4ps1start) %>% var_label()
hsls #> $x4ps1start
#> [1] "X4 Month and year of enrollment at first postsecondary institution"
%>% select(x4ps1start) %>% val_labels()
hsls #> $x4ps1start
#> Missing
#> -9
#> Unit non-response
#> -8
#> Item legitimate skip/NA
#> -7
#> Component not applicable
#> -6
#> Item not administered: abbreviated interview
#> -4
#> Carry through missing
#> -3
#> Don't know
#> -1
3.0.13 One-way descriptive stats student exercise solutions
- Get a frequency count of the variable
x4ps1start
%>%
hsls count(x4ps1start)
#> # A tibble: 9 × 2
#> x4ps1start n
#> <dbl+lbl> <int>
#> 1 -9 [Missing] 107
#> 2 -8 [Unit non-response] 6168
#> 3 -7 [Item legitimate skip/NA] 4281
#> 4 201100 57
#> 5 201200 206
#> 6 201300 10800
#> 7 201400 1295
#> 8 201500 471
#> 9 201600 118
3.0.14 One-way descriptive stats student exercise solutions
- Get a frequency count of the variable, but this time only observations that have negative values hint: use filter()
%>%
hsls filter(x4ps1start<0) %>%
count(x4ps1start)
#> # A tibble: 3 × 2
#> x4ps1start n
#> <dbl+lbl> <int>
#> 1 -9 [Missing] 107
#> 2 -8 [Unit non-response] 6168
#> 3 -7 [Item legitimate skip/NA] 4281
3.0.15 One-way descriptive stats student exercise solutions
- Create a new version
x4ps1start_na
of the variablex4ps1start
that replaces negative values with NAs and usesummarise_at()
to get the min and max value.
%>% mutate(x4ps1start_na=ifelse(x4ps1start<0,NA,x4ps1start)) %>%
hsls summarise_at(
.vars = vars(x4ps1start_na),
.funs = funs(min, max, .args=list(na.rm=TRUE))
)#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Warning: Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> # A tibble: 1 × 2
#> min max
#> <dbl> <dbl>
#> 1 201100 201600
3.0.16 One-way descriptive stats for discrete/categorical vars, Tidyverse approach
Use count()
to investigate values of discrete or categorical variables
For variables where class==labelled
class(hsls_small$s3classes)
attributes(hsls_small$s3classes)
#show counts of variable values
%>% count(s3classes) #print in console to show both
hsls_small #show counts of value labels
%>% count(s3classes) %>% as_factor() hsls_small
- I like
count()
because the default setting is to showNA
values too!
%>% mutate(s3classes_na=ifelse(s3classes<0,NA,s3classes)) %>%
hsls_small count(s3classes_na)
Simultaneously show both values and value labels on count tables for class==labelled
if entered into console
- This requires some concepts/functions we haven’t introduced [SKIP]
%>% count(s3classes)
hsls_small <- hsls_small %>% count(s3classes) %>% as_factor()
y bind_cols(x[,1], y) #wont show in updated R
3.0.17 Relationship between variables, categorical by categorical
Two-way frequency table, called “cross tabulation”, important for data quality
- When you create categorical analysis var from single categorical “input” var
- Two-way tables show us whether we did this correctly
- Two-way tables show us whether we did this correctly
- Two-way tables helpful for understanding skip patterns in surveys
key to syntax
df_name %>% group_by(var1) %>% count(var2)
ORdf_name %>% count(var1,var2)
- play around with which variable is
var1
and which variable isvar2
3.0.18 Relationship between variables, categorical by categorical
Task: Create a two-way table between s3classes
and s3clglvl
- Investigate variables
%>% select(s3classes,s3clglvl) %>% var_label()
hsls_small %>% select(s3classes,s3clglvl) %>% val_labels() hsls_small
- Create two-way table
%>% group_by(s3classes) %>% count(s3clglvl) # show values
hsls_small %>% count(s3classes,s3clglvl)
hsls_small #hsls_small %>% group_by(s3classes) %>% count(s3clglvl) %>% as_factor() # show value labels
- Are these objects the same?
%>% group_by(s3classes) %>% count(s3clglvl) %>% glimpse()
hsls_small #> Rows: 8
#> Columns: 3
#> Groups: s3classes [5]
#> $ s3classes <dbl+lbl> -9, -8, 1, 1, 1, 1, 2, 3
#> $ s3clglvl <dbl+lbl> -9, -8, -9, 1, 2, 3, -7, -7
#> $ n <int> 59, 4945, 428, 8894, 3929, 226, 3401, 1621
%>% count(s3classes,s3clglvl) %>% glimpse()
hsls_small #> Rows: 8
#> Columns: 3
#> $ s3classes <dbl+lbl> -9, -8, 1, 1, 1, 1, 2, 3
#> $ s3clglvl <dbl+lbl> -9, -8, -9, 1, 2, 3, -7, -7
#> $ n <int> 59, 4945, 428, 8894, 3929, 226, 3401, 1621
3.0.19 Relationship between variables, categorical by categorical
Two-way frequency table, also called “cross tabulation”
Task:
- Create a version of
s3classes
calleds3classes_na
that changes negative values toNA
- Create a two-way table between
s3classes_na
ands3clglvl
%>%
hsls_small mutate(s3classes_na=ifelse(s3classes<0,NA,s3classes)) %>%
group_by(s3classes_na) %>% count(s3clglvl)
%>%
hsls_small mutate(s3classes_na=ifelse(s3classes<0,NA,s3classes)) %>%
count(s3classes_na, s3clglvl)
#example where we create some NA obs in the second variable
%>%
hsls_small mutate(s3classes_na=ifelse(s3classes<0,NA,s3classes),
s3clglvl_na=ifelse(s3clglvl==-7,NA,s3clglvl)) %>%
group_by(s3classes_na) %>% count(s3clglvl_na)
%>%
hsls_small mutate(s3classes_na=ifelse(s3classes<0,NA,s3classes),
s3clglvl_na=ifelse(s3clglvl==-7,NA,s3clglvl)) %>%
count(s3classes_na, s3clglvl_na)
3.0.20 Relationship between variables, categorical by categorical [SKIP]
Tables above are pretty ugly
Use the spread()
function from tidyr
package to create table with one variable as columns and the other variable as rows
- The variable you place in
spread()
will be columns - We learn
spread()
function in a subsequent lecture
%>% group_by(s3classes) %>% count(s3clglvl) %>%
hsls_small spread(s3classes, n)
%>% group_by(s3classes) %>% count(s3clglvl) %>%
hsls_small as_factor() %>% spread(s3classes, n)
%>% group_by(s3classes) %>% count(s3clglvl) %>%
hsls_small as_factor() %>% spread(s3clglvl, n)
3.0.21 Relationship between variables, categorical by continuous
Investigating relationship between multiple variables is a little tougher when at least one of the variables is continuous
Conditional mean (like regression with continuous Y and one categorical X):
- Shows average values of continous variables within groups
- Groups are defined by your categorical variable(s)
key to syntax
group_by(categorical_var) %>% summarise_at(.vars = vars(continuous_var)
3.0.22 Relationship between variables, categorical by continuous
Task
- Calculate mean math score,
x2txmtscor
, for each value of parental education,x2paredu
#first, investigate parental education [print in console]
%>% count(x2paredu) hsls_small
# using dplyr to get average math score by parental education level [print in console]
%>% group_by(x2paredu) %>%
hsls_small summarise_at(.vars = vars(x2txmtscor),
.funs = funs(mean, .args = list(na.rm = TRUE)))
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Warning: Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> # A tibble: 8 × 2
#> x2paredu x2txmtscor
#> <dbl+lbl> <dbl>
#> 1 -8 [Unit non-response] -8
#> 2 1 [Less than high school] 44.3
#> 3 2 [High school diploma or GED or alterntive HS credential] 47.2
#> 4 3 [Certificate/diploma from school providing occupational trainin… 46.4
#> 5 4 [Associate's degree] 48.9
#> 6 5 [Bachelor's degree] 53.3
#> 7 6 [Master's degree] 55.6
#> 8 7 [Ph.D/M.D/Law/other high lvl prof degree] 58.9
3.0.23 Relationship between variables, categorical by continuous
Task
- Calculate mean math score,
x2txmtscor
, for each value ofx2paredu
For checking data quality, helpful to calculate other stats besides mean
%>% group_by(x2paredu) %>% #[print in console]
hsls_small summarise_at(.vars = vars(x2txmtscor),
.funs = funs(mean, min, max, .args = list(na.rm = TRUE)))
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Warning: Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
Always Investigate presence of missing/skip values
%>% filter(x2paredu<0) %>% count(x2paredu)
hsls_small %>% filter(x2txmtscor<0) %>% count(x2txmtscor) hsls_small
Replace -8
with NA
and re-calculate conditional stats
%>%
hsls_small mutate(x2paredu_na=ifelse(x2paredu<0,NA,x2paredu),
x2txmtscor_na=ifelse(x2txmtscor<0,NA,x2txmtscor)) %>%
group_by(x2paredu_na) %>%
summarise_at(.vars = vars(x2txmtscor_na),
.funs = funs(mean, min, max, .args = list(na.rm = TRUE))) %>%
as_factor()
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Warning: Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> Warning: There were 2 warnings in `summarise()`.
#> The first warning was:
#> ℹ In argument: `min = min(x2txmtscor_na, na.rm = TRUE)`.
#> ℹ In group 8: `x2paredu_na = NA`.
#> Caused by warning in `min()`:
#> ! no non-missing arguments to min; returning Inf
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
%>% count(s3classes,s3clglvl) %>% as_factor hsls_small
3.0.24 Student exercise
Can use same approach to calculate conditional mean by multiple group_by()
variables
- Just add additional variables within
group_by()
- Calculate mean math test score (
x2txmtscor
), for each combination of parental education (x2paredu
) and sex (x2sex
).
3.0.25 Student exercise solution
- Calculate mean math test score (
x2txmtscor
), for each combination of parental education (x2paredu
) and sex (x2sex
)
#hsls_small %>% count(x2sex)
%>%
hsls_small group_by(x2paredu,x2sex) %>%
summarise_at(.vars = vars(x2txmtscor),
.funs = funs(mean, .args = list(na.rm = TRUE))) %>%
as_factor()
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> Warning: Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
4 Appendix: Skip patterns in survey data
4.0.1 What are skip patterns
Pretty easy to create an analysis variable based on a single input variable
Harder to create analysis variables based on multiple input variables
- When working with survey data, even seemingly simple analysis variables require multiple input variables due to “skip patterns”
What are “skip patterns”?
- Response on a particular survey item determines whether respondent answers some set of subsequent questions
- What are some examples of this?
Key to working with skip patterns
- Have the survey questionnaire on hand
- Sometimes it appears that analysis variable requires only one input variable, but really depends on several input variables because of skip patterns
- Don’t just blindly turn “missing” and “skips” from survey data to
NAs
in your analysis variable - Rather, trace why these “missing” and “skips” appear and decide how they should be coded in your analysis variable
- Don’t just blindly turn “missing” and “skips” from survey data to
4.0.2 Creating analysis variables in the presence of skip patterns
Task: Create a measure of “level” of postsecondary institution attended in 2013 from HSLS:09 survey data
- “level” is highest award-level of the postsecondary institution
- e.g., if highest award is associate’s degree (a two-year degree), then `level==2’
- The measure,
pselev2013
, should have following [non-missing] values:- Not attending postsecondary education institution
- Attending a 2-year or less-than-2-year institution
- Attending 4-year or greater-than-4year institution
Background info:
- In “2013 Update” of HSLS:09, students asked about college attendance
- Variables from student responses to “2013 Update” have prefix
s3
- Variables from student responses to “2013 Update” have prefix
- Survey questionnaire for 2013 update can be found HERE
- The “online codebook” website HERE has info about specific variables
- Measure has 3 input variables [usually must figure this out yourself]:
x3sqstat
: “X3 Student questionnaire status”s3classes
: “S3 B01A Taking postsecondary classes as of Nov 1 2013”s3clglvl
: “S3 Enrolled college IPEDS level”
%>% select(x3sqstat,s3classes,s3clglvl) %>% var_label() hsls_small
You won’t have time to complete this task, but develop a plan for the task and get as far as you can
4.0.3 Creating analysis variables in the presence of skip patterns
Step 1a: Investigate each input variable separately
#variable labels
%>% select(x3sqstat,s3classes,s3clglvl) %>% var_label()
hsls_small
%>% count(x3sqstat)
hsls_small %>% count(x3sqstat) %>% as_factor()
hsls_small
%>% count(s3classes)
hsls_small %>% count(s3classes) %>% as_factor()
hsls_small
%>% count(s3clglvl)
hsls_small %>% count(s3clglvl) %>% as_factor() hsls_small
4.0.4 Creating analysis variables in the presence of skip patterns
Step 1b: Investigate relationship between input variables
#x3sqstate and s3classes
%>% group_by(x3sqstat) %>% count(s3classes)
hsls_small %>% group_by(x3sqstat) %>% count(s3classes) %>% as_factor()
hsls_small
%>% filter(x3sqstat==8) %>% count(s3classes)
hsls_small %>% filter(x3sqstat==8) %>% count(s3classes==-8)
hsls_small %>% filter(x3sqstat !=8) %>% count(s3classes)
hsls_small
#x3sqstate, s3classes and s3clglvl
%>% group_by(s3classes) %>% count(s3clglvl)
hsls_small %>% group_by(s3classes) %>% count(s3clglvl) %>% as_factor()
hsls_small
#add filter for whether student did not respond to X3 questionnaire
%>% filter(x3sqstat==8) %>% group_by(s3classes) %>% count(s3clglvl)
hsls_small %>% filter(x3sqstat !=8) %>% group_by(s3classes) %>% count(s3clglvl)
hsls_small
#continued on the next page
4.0.5 Creating analysis variables in the presence of skip patterns
Step 1b: Investigate relationship between input variables continued…
#add filter for s3classes is "missing" [-9]
%>% filter(x3sqstat !=8,s3classes==-9) %>% group_by(s3classes) %>%
hsls_small count(s3clglvl)
#> # A tibble: 1 × 3
#> # Groups: s3classes [1]
#> s3classes s3clglvl n
#> <dbl+lbl> <dbl+lbl> <int>
#> 1 -9 [Missing] -9 [Missing] 59
%>% filter(x3sqstat !=8,s3classes!=-9) %>% group_by(s3classes) %>%
hsls_small count(s3clglvl)
#> # A tibble: 6 × 3
#> # Groups: s3classes [3]
#> s3classes s3clglvl n
#> <dbl+lbl> <dbl+lbl> <int>
#> 1 1 [Yes] -9 [Missing] 428
#> 2 1 [Yes] 1 [4 or more years] 8894
#> 3 1 [Yes] 2 [At least 2 but less than 4 years] 3929
#> 4 1 [Yes] 3 [Less than 2 years (below associate)] 226
#> 5 2 [No] -7 [Item legitimate skip/NA] 3401
#> 6 3 [Don't know] -7 [Item legitimate skip/NA] 1621
#add filter for s3classes equal to "no" or "don't know"
%>% filter(x3sqstat !=8,s3classes!=-9, s3classes %in% c(2,3)) %>%
hsls_small group_by(s3classes) %>% count(s3clglvl)
#> # A tibble: 2 × 3
#> # Groups: s3classes [2]
#> s3classes s3clglvl n
#> <dbl+lbl> <dbl+lbl> <int>
#> 1 2 [No] -7 [Item legitimate skip/NA] 3401
#> 2 3 [Don't know] -7 [Item legitimate skip/NA] 1621
%>% filter(x3sqstat !=8,s3classes!=-9, s3classes %in% c(2,3)) %>%
hsls_small group_by(s3classes) %>% count(s3clglvl) %>% as_factor()
#> # A tibble: 2 × 3
#> # Groups: s3classes [2]
#> s3classes s3clglvl n
#> <fct> <fct> <int>
#> 1 No Item legitimate skip/NA 3401
#> 2 Don't know Item legitimate skip/NA 1621
%>% filter(x3sqstat !=8,s3classes!=-9, s3classes==1) %>%
hsls_small group_by(s3classes) %>% count(s3clglvl)
#> # A tibble: 4 × 3
#> # Groups: s3classes [1]
#> s3classes s3clglvl n
#> <dbl+lbl> <dbl+lbl> <int>
#> 1 1 [Yes] -9 [Missing] 428
#> 2 1 [Yes] 1 [4 or more years] 8894
#> 3 1 [Yes] 2 [At least 2 but less than 4 years] 3929
#> 4 1 [Yes] 3 [Less than 2 years (below associate)] 226
%>% filter(x3sqstat !=8,s3classes!=-9, s3classes==1) %>%
hsls_small group_by(s3classes) %>% count(s3clglvl) %>% as_factor()
#> # A tibble: 4 × 3
#> # Groups: s3classes [1]
#> s3classes s3clglvl n
#> <fct> <fct> <int>
#> 1 Yes Missing 428
#> 2 Yes 4 or more years 8894
#> 3 Yes At least 2 but less than 4 years 3929
#> 4 Yes Less than 2 years (below associate) 226