1 Introduction

Lecture overview

Introduction to College Scorecard data on debt and earnings from degree programs
Brief review of statistics (selected concepts)

1.1 Libraries we will use

#install.packages('tidyverse') # if you haven't installed already
#install.packages('labelled') # if you haven't installed already

library(tidyverse) # load tidyverse package
library(labelled) # load labelled package package

2 College Scorecard data

The College Scorecard is a tool created by the US Department of Education that seeks to help prospective students investigate/compare postsecondary institutions and degree programs

We will use data from the College Scorecard in lecture and potentially for some assignments
Relevant links

Recently, the US Department of Education released new College Scorecard data on debt and earnings associated with postsecondary degree programs

(In theory) for each postsecondary degree program offered by a college/university, College Scorecard identifies average student debt associated with that degree program (for federal student loan programs) and average earnings of graduates
Many debt and earnings variables have the value 'PrivacySuppressed' because number of graduates is sufficiently small that there are concerns about being able to identify individual students
Generally, academic programs with “large” enrollment are not suppressed
We will focus on debt and earnings associated with master’s (MA) programs
- Hopefully, this information will be useful for students considering an MA program down the road

In the following sub-sections, we “load” the data, create modified datasets, investigate the data, and run some basic descriptive statistics

Your are not responsible for knowing the below code
- You will only be responsible for knowing code that we explicitly teach you during the quarter
But try to follow the general logic of what the code is doing
And try running the below “code chunks” on your own computer

2.1 Load and inspect data

Load data frame (i.e., dataset)

load(file = url('https://github.com/anyone-can-cook/educ152/raw/main/data/college_scorecard/output_data/df_debt_earn_panel_labelled.RData'))

Create subset data frame:

df_scorecard <- df_debt_earn_panel_labelled %>%
    # keep most recent year of data
    filter(field_ay == '2017-18') %>%
    # keep master's degrees
    filter(credlev == 5) %>%
    # carnegie categories to keep: 15 = Doctoral Universities: Very High Research Activity; 16 = Doctoral Universities: High Research Activity
    filter(ccbasic %in% c(15,16)) %>%
    # drop "parent plus" loan variables and other vars we won't use in this lecture
    select(-contains('_pp'),-contains('_any'),-field_ay,-st_fips,-zip,-longitude,-latitude,-locale2,-highdeg,-accredagency,-relaffil,-hbcu,-annhi,-tribal,-aanapii,-hsi,-nanti,-main,-numbranch) %>%
    # create variable for broad field of degree (e.g., education, business)
    mutate(cipdig2 = str_sub(string = cipcode, start = 1, end = 2)) %>%
    # shorten variable cipdesc to make it more suitable for printing
    mutate(cipdesc = str_sub(string = cipdesc, start = 1, end = 50)) %>%
    # re-order variables
    relocate(opeid6,unitid,instnm,control,ccbasic,stabbr,city,cipdig2)

# "glimpse" data frame
df_scorecard %>% glimpse()
#> Rows: 15,336
#> Columns: 24
#> $ opeid6                        <chr> "001009", "001009", "001009", "001009...
#> $ unitid                        <dbl> 100858, 100858, 100858, 100858, 10085...
#> $ instnm                        <chr> "Auburn University", "Auburn Universi...
#> $ control                       <chr+lbl> Public, Public, Public, Public, P...
#> $ ccbasic                       <dbl+lbl> 15, 15, 15, 15, 15, 15, 15, 15, 1...
#> $ stabbr                        <chr> "AL", "AL", "AL", "AL", "AL", "AL", "...
#> $ city                          <chr> "Auburn", "Auburn", "Auburn", "Auburn...
#> $ cipdig2                       <chr> "01", "01", "01", "01", "01", "03", "...
#> $ cipcode                       <chr> "0101", "0103", "0109", "0111", "0181...
#> $ cipdesc                       <chr> "Agricultural Business and Management...
#> $ credlev                       <dbl+lbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
#> $ creddesc                      <chr> "Master's Degree", "Master's Degree",...
#> $ ipedscount1                   <chr> "7", "15", "17", "24", "NULL", "7", "...
#> $ ipedscount2                   <chr> "3", "14", "11", "38", "NULL", "4", "...
#> $ debt_all_stgp_eval_n          <chr> "PrivacySuppressed", "PrivacySuppress...
#> $ debt_all_stgp_eval_mean       <chr> "PrivacySuppressed", "PrivacySuppress...
#> $ debt_all_stgp_eval_mdn        <chr> "PrivacySuppressed", "PrivacySuppress...
#> $ debt_all_stgp_eval_mdn10yrpay <chr> "PrivacySuppressed", "PrivacySuppress...
#> $ earn_count_wne_hi_1yr         <chr> "PrivacySuppressed", "15", "12", "11"...
#> $ earn_mdn_hi_1yr               <chr> "PrivacySuppressed", "46478", "43426"...
#> $ earn_count_wne_hi_2yr         <chr> "PrivacySuppressed", "13", "PrivacySu...
#> $ earn_mdn_hi_2yr               <chr> "PrivacySuppressed", "44942", "Privac...
#> $ region                        <dbl+lbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
#> $ locale                        <dbl+lbl> 13, 13, 13, 13, 13, 13, 13, 13, 1...

# For debt and earnings variables, convert from character to numeric variables (which replaces "PrivacySuppressed" values with NA values)
df_scorecard <- df_scorecard %>%
  mutate(
    debt_all_stgp_eval_n = as.numeric(debt_all_stgp_eval_n),
    debt_all_stgp_eval_mean = as.numeric(debt_all_stgp_eval_mean),
    debt_all_stgp_eval_mdn = as.numeric(debt_all_stgp_eval_mdn),
    debt_all_stgp_eval_mdn10yrpay = as.numeric(debt_all_stgp_eval_mdn10yrpay),
    earn_count_wne_hi_1yr = as.numeric(earn_count_wne_hi_1yr),
    earn_mdn_hi_1yr = as.numeric(earn_mdn_hi_1yr),
    earn_count_wne_hi_2yr = as.numeric(earn_count_wne_hi_2yr),
    earn_mdn_hi_2yr = as.numeric(earn_mdn_hi_2yr)
  ) 

# add variable label to variable cipdig2
  attr(df_scorecard[['cipdig2']], which = 'label') <- 'broad degree field code = 2-digit classification of instructional programs (CIP) degree code'

# add variable label attribute back to debt and earnings variables
  for(v in c('debt_all_stgp_eval_n','debt_all_stgp_eval_mean','debt_all_stgp_eval_mdn','debt_all_stgp_eval_mdn10yrpay','earn_count_wne_hi_1yr','earn_mdn_hi_1yr','earn_count_wne_hi_2yr','earn_mdn_hi_2yr','cipdesc')) {
    
    #writeLines(str_c('object v=', v))
    #writeLines(attr(df_debt_earn_panel_labelled[[v]], which = 'label'))
    
    attr(df_scorecard[[v]], which = 'label') <- attr(df_debt_earn_panel_labelled[[v]], which = 'label')
  }

df_scorecard %>% glimpse()
#> Rows: 15,336
#> Columns: 24
#> $ opeid6                        <chr> "001009", "001009", "001009", "001009...
#> $ unitid                        <dbl> 100858, 100858, 100858, 100858, 10085...
#> $ instnm                        <chr> "Auburn University", "Auburn Universi...
#> $ control                       <chr+lbl> Public, Public, Public, Public, P...
#> $ ccbasic                       <dbl+lbl> 15, 15, 15, 15, 15, 15, 15, 15, 1...
#> $ stabbr                        <chr> "AL", "AL", "AL", "AL", "AL", "AL", "...
#> $ city                          <chr> "Auburn", "Auburn", "Auburn", "Auburn...
#> $ cipdig2                       <chr> "01", "01", "01", "01", "01", "03", "...
#> $ cipcode                       <chr> "0101", "0103", "0109", "0111", "0181...
#> $ cipdesc                       <chr> "Agricultural Business and Management...
#> $ credlev                       <dbl+lbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
#> $ creddesc                      <chr> "Master's Degree", "Master's Degree",...
#> $ ipedscount1                   <chr> "7", "15", "17", "24", "NULL", "7", "...
#> $ ipedscount2                   <chr> "3", "14", "11", "38", "NULL", "4", "...
#> $ debt_all_stgp_eval_n          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1...
#> $ debt_all_stgp_eval_mean       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 6...
#> $ debt_all_stgp_eval_mdn        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ debt_all_stgp_eval_mdn10yrpay <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ earn_count_wne_hi_1yr         <dbl> NA, 15, 12, 11, NA, NA, NA, NA, NA, 2...
#> $ earn_mdn_hi_1yr               <dbl> NA, 46478, 43426, 43314, NA, NA, NA, ...
#> $ earn_count_wne_hi_2yr         <dbl> NA, 13, NA, NA, NA, NA, NA, 11, NA, 2...
#> $ earn_mdn_hi_2yr               <dbl> NA, 44942, NA, NA, NA, NA, NA, 42712,...
#> $ region                        <dbl+lbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
#> $ locale                        <dbl+lbl> 13, 13, 13, 13, 13, 13, 13, 13, 1...

Investigate dataset, show variable labels

df_scorecard %>% var_label()
#> $opeid6
#> [1] "6-digit OPE ID for institution"
#> 
#> $unitid
#> [1] "Unit ID for institution"
#> 
#> $instnm
#> [1] "Institution name"
#> 
#> $control
#> [1] "Control of institution"
#> 
#> $ccbasic
#> [1] "Carnegie Classification -- basic"
#> 
#> $stabbr
#> [1] "State postcode"
#> 
#> $city
#> [1] "City"
#> 
#> $cipdig2
#> [1] "broad degree field code = 2-digit classification of instructional programs (CIP) degree code"
#> 
#> $cipcode
#> [1] "Classification of Instructional Programs (CIP) code for the field of study"
#> 
#> $cipdesc
#> [1] "Text description of the field of study CIP Code"
#> 
#> $credlev
#> [1] "Level of credential"
#> 
#> $creddesc
#> [1] "Text description of the level of credential"
#> 
#> $ipedscount1
#> [1] "Number of awards to all students in year 1 of the pooled debt cohort"
#> 
#> $ipedscount2
#> [1] "Number of awards to all students in year 2 of the pooled debt cohort"
#> 
#> $debt_all_stgp_eval_n
#> [1] "Borrower count for average/median Stafford and Grad PLUS loan debt disbursed at this institution"
#> 
#> $debt_all_stgp_eval_mean
#> [1] "Average Stafford and Grad PLUS loan debt disbursed at this institution"
#> 
#> $debt_all_stgp_eval_mdn
#> [1] "Median Stafford and Grad PLUS loan debt disbursed at this institution"
#> 
#> $debt_all_stgp_eval_mdn10yrpay
#> [1] "Median estimated monthly payment for Stafford and Grad PLUS loan debt disbursed at this institution"
#> 
#> $earn_count_wne_hi_1yr
#> [1] "Number of graduates working and not enrolled 1 year after completing highest credential"
#> 
#> $earn_mdn_hi_1yr
#> [1] "Median earnings of graduates working and not enrolled 1 year after completing highest credential"
#> 
#> $earn_count_wne_hi_2yr
#> [1] "Number of graduates working and not enrolled 2 years after completing highest credential"
#> 
#> $earn_mdn_hi_2yr
#> [1] "Median earnings of graduates working and not enrolled 2 years after completing highest credential"
#> 
#> $region
#> [1] "Region (IPEDS)"
#> 
#> $locale
#> [1] "Locale of institution"

Investigate dataset, show “value labels” for variables that have value labels

code not run

df_scorecard %>% val_labels()

2.2 MA debt/earnings at fancy univs

Investigate debt and earnings from a few fancy universities

UC-Berkeley
- unitid == 110635
- opeid6 == 001312
UCLA
- unitid == 110662
- opeid6 == 001315
USC
- unitid == 123961
- opeid6 == 001328
Stanford
- unitid == 243744
- opeid6 == 001305
Columbia
- unitid == 190150
- opeid6 == 002707
Columbia, Teacher’s College
- unitid == 196468
- opeid6 == 003979
NYU
- unitid == 193900
- opeid6 == 002785
Harvard
- unitid == 166027
- opeid6 == 002155

Examine debt and earnings from master’s programs at some fancy universities

df_scorecard %>% 
  # opeid6: UC-Berkeley = 001312; UCLA = 001315; USC = 001328; Stanford = 001305; Columbia = 002707; Columbia, Teacher's College = 003979; NYU = 002785; Harvard = 002155
  filter(opeid6 %in% c('001312','001315','001328','001305','002707','003979','002785','002155')) %>%
  # filter observations where debt_all_stgp_eval_n is not missing (NA)
  filter(is.na(debt_all_stgp_eval_n)==0) %>%
  select(instnm,cipdig2,cipcode,cipdesc,debt_all_stgp_eval_n,debt_all_stgp_eval_mean,earn_mdn_hi_2yr)
#> # A tibble: 220 x 7
#>    instnm cipdig2 cipcode cipdesc debt_all_stgp_e~ debt_all_stgp_e~
#>    <chr>  <chr>   <chr>   <chr>              <dbl>            <dbl>
#>  1 Stanf~ 05      0501    Area S~               14            41792
#>  2 Stanf~ 09      0901    Commun~               11            45456
#>  3 Stanf~ 11      1107    Comput~               22            43580
#>  4 Stanf~ 13      1301    Educat~              171            35264
#>  5 Stanf~ 14      1402    Aerosp~               13            63721
#>  6 Stanf~ 14      1408    Civil ~               51            47367
#>  7 Stanf~ 14      1410    Electr~               25            50608
#>  8 Stanf~ 14      1419    Mechan~               20            52503
#>  9 Stanf~ 15      1515    Engine~               20            56004
#> 10 Stanf~ 51      5122    Public~               11            60488
#> # ... with 210 more rows, and 1 more variable: earn_mdn_hi_2yr <dbl>
#%>% print(n=200)

Examine debt and earnings from master’s programs in education at some fancy universities

df_scorecard %>% 
  # opeid6: UC-Berkeley = 001312; UCLA = 001315; USC = 001328; Stanford = 001305; Columbia = 002707; Columbia, Teacher's College = 003979; NYU = 002785; Harvard = 002155
  filter(opeid6 %in% c('001312','001315','001328','001305','002707','003979','002785','002155')) %>%
  # filter observations where debt_all_stgp_eval_n is not missing (NA)
  filter(is.na(debt_all_stgp_eval_n)==0) %>%
  # filter degree programs in education
  filter(cipdig2 == '13') %>%
  select(instnm,cipdig2,cipcode,cipdesc,debt_all_stgp_eval_n,debt_all_stgp_eval_mean,earn_mdn_hi_2yr)
#> # A tibble: 26 x 7
#>    instnm cipdig2 cipcode cipdesc debt_all_stgp_e~ debt_all_stgp_e~
#>    <chr>  <chr>   <chr>   <chr>              <dbl>            <dbl>
#>  1 Stanf~ 13      1301    "Educa~              171            35264
#>  2 Unive~ 13      1301    "Educa~              221            32803
#>  3 Unive~ 13      1304    "Educa~               59            67159
#>  4 Unive~ 13      1306    "Educa~               17            61611
#>  5 Unive~ 13      1311    "Stude~               48            84293
#>  6 Unive~ 13      1313    "Teach~              273            66099
#>  7 Unive~ 13      1314    "Teach~               71            61120
#>  8 Harva~ 13      1301    "Educa~              641            35492
#>  9 New Y~ 13      1304    "Educa~               17            73952
#> 10 New Y~ 13      1305    "Educa~               17            86476
#> # ... with 16 more rows, and 1 more variable: earn_mdn_hi_2yr <dbl>

2.3 MAs in education

Create a new data frame that contains data about Education MA programs in which debt variables are not suppressed

df_scorecard_edu <- df_scorecard %>% 
  # filter: degree programs in education; and debt_all_stgp_eval_n is not missing (NA)
  filter(cipdig2 == '13',is.na(debt_all_stgp_eval_n)==0)

Investigate new data frame df_scorecard_edu

#df_scorecard_edu %>% glimpse()

# investigate data structure
  # one observation per opeid6-cipcode
  df_scorecard_edu %>% group_by(opeid6,cipcode) %>% summarise(n_per_key=n()) %>% ungroup() %>% count(n_per_key)
#> # A tibble: 1 x 2
#>   n_per_key     n
#>       <int> <int>
#> 1         1   773
  #df_scorecard_edu %>% group_by(unitid,cipcode) %>% summarise(n_per_key=n()) %>% ungroup() %>% count(n_per_key)

# name of institutions
df_scorecard_edu %>% group_by(instnm) %>% slice(1) %>% ungroup() %>% select(instnm)
#> # A tibble: 216 x 1
#>    instnm                               
#>    <chr>                                
#>  1 American University                  
#>  2 Arizona State University-Tempe       
#>  3 Arkansas State University-Main Campus
#>  4 Auburn University                    
#>  5 Azusa Pacific University             
#>  6 Ball State University                
#>  7 Baylor University                    
#>  8 Binghamton University                
#>  9 Boise State University               
#> 10 Boston College                       
#> # ... with 206 more rows

# control (public, private)

df_scorecard_edu %>% group_by(opeid6) %>% slice(1) %>% ungroup %>% count(control) %>% as_factor()
#> # A tibble: 2 x 2
#>   control                n
#>   <fct>              <int>
#> 1 Private, nonprofit    51
#> 2 Public               165

# carnegie classification
df_scorecard_edu %>% group_by(opeid6) %>% slice(1) %>% ungroup %>% count(ccbasic) %>% as_factor()
#> # A tibble: 2 x 2
#>   ccbasic                                                n
#>   <fct>                                              <int>
#> 1 Doctoral Universities: Very High Research Activity   108
#> 2 Doctoral Universities: High Research Activity        108

# level of urbanization
df_scorecard_edu %>% group_by(opeid6) %>% slice(1) %>% ungroup %>% count(locale) %>% as_factor()
#> # A tibble: 9 x 2
#>   locale                                                                       n
#>   <fct>                                                                    <int>
#> 1 City: Large (population of 250,000 or more)                                 76
#> 2 City: Midsize (population of at least 100,000 but less than 250,000)        40
#> 3 City: Small (population less than 100,000)                                  39
#> 4 Suburb: Large (outside principal city, in urbanized area with populatio~    29
#> 5 Suburb: Midsize (outside principal city, in urbanized area with populat~     6
#> 6 Suburb: Small (outside principal city, in urbanized area with populatio~     7
#> 7 Town: Fringe (in urban cluster up to 10 miles from an urbanized area)        3
#> 8 Town: Distant (in urban cluster more than 10 miles and up to 35 miles f~     9
#> 9 Town: Remote (in urban cluster more than 35 miles from an urbanized are~     7

# which education degrees
df_scorecard_edu %>% group_by(cipdesc) %>% slice(1) %>% ungroup %>% count(cipdesc)
#> # A tibble: 13 x 2
#>    cipdesc                                                  n
#>    <chr>                                                <int>
#>  1 "Bilingual, Multilingual, and Multicultural Educati"     1
#>  2 "Curriculum and Instruction."                            1
#>  3 "Education, General."                                    1
#>  4 "Education, Other."                                      1
#>  5 "Educational Administration and Supervision."            1
#>  6 "Educational Assessment, Evaluation, and Research."      1
#>  7 "Educational/Instructional Media Design."                1
#>  8 "International and Comparative Education."               1
#>  9 "Social and Philosophical Foundations of Education."     1
#> 10 "Special Education and Teaching."                        1
#> 11 "Student Counseling and Personnel Services."             1
#> 12 "Teacher Education and Professional Development, Sp"     1
#> 13 "Teaching English or French as a Second or Foreign "     1

Investigate data frame df_scorecard_edu, debt and earnings variables

# mean of mean debt from sfafford and grad plus
  df_scorecard_edu %>% summarize(
    mean_debt = mean(debt_all_stgp_eval_mean, na.rm = TRUE)
  )
#> # A tibble: 1 x 1
#>   mean_debt
#>       <dbl>
#> 1    35415.
  
  # separate for public vs. private
  df_scorecard_edu %>% group_by(control) %>% summarize(
    mean_debt = mean(debt_all_stgp_eval_mean, na.rm = TRUE)
  )
#> # A tibble: 2 x 2
#>   control            mean_debt
#>   <chr+lbl>              <dbl>
#> 1 Private, nonprofit    47341.
#> 2 Public                32447.

# mean of median earnings, 2 years after graduation
  df_scorecard_edu %>% summarize(
    mean_earn = mean(earn_mdn_hi_2yr, na.rm = TRUE)
  )
#> # A tibble: 1 x 1
#>   mean_earn
#>       <dbl>
#> 1    48194.
  
  # separate for public vs. private
  df_scorecard_edu %>% group_by(control) %>% summarize(
    mean_earn = mean(earn_mdn_hi_2yr, na.rm = TRUE)
  )
#> # A tibble: 2 x 2
#>   control            mean_earn
#>   <chr+lbl>              <dbl>
#> 1 Private, nonprofit    52645.
#> 2 Public                47104.

3 Review of Statistics

3.1 Statistical Inference

In general, the goal of statistical inference is to infer something about a population based on a sample from that population

Population Parameter: a measure of the population

e.g., mean household income for all U.S. households; proportion of all American voters approving president’s performance;
We don’t know this 99.9999999% of the time because we don’t have data on the entire population

Sample is part of, a subset, of the population

Estimator (or Statistic): a formula or procedure used to estimate the value of the population parameter using a sample of the population

e.g., formula to calculate sample mean ($\bar{Y}$) or sample proportion ($\hat{p}$)

Point Estimate: numeric value generated from calculating an estimator from a specific sample of data

e.g., imagine that sample mean household income is $81,000, 2017 American Community Survey

3.2 Parameters and Point Estimates

Introductory statistics class

Parameter: Population mean ($\mu_Y$)
Estimator: Sample mean ($\bar{Y}$)
RQs: What is the mean GPA of UCLA students living in residence halls?

Multivariate regression class

Parameter: Population regression coefficient ($\beta$)
Estimator: Sample regression coefficient ($\hat{\beta}$)
RQs: What is the effect of living in residence halls on GPA?

3.3 Notation for Parameters

Population Parameters

Denoted by Greek setters, usually lowercase
- $\mu$ : “mu” refers to population mean
- $\sigma$: “sigma” refers to population standard deviation
- $\beta$: “beta” refers to population regression coefficient
Subscripts usually denote population parameters of certain variables
- $\mu_Y$ : “mu Y” refers to population mean of the variable Y
- $\sigma_X$: “sigma X” refers to population standard deviation of the variable X

Estimators of Population Parameters (two general approaches)

Denoted using Greek letters with a “hat”
- $\hat{\mu}_Y$ : “mu Y hat” refers to the estimate of $\mu_Y$
- $\hat{\beta}_X$: “beta X hat” refers to the estimate of $\beta_X$
Denoted using English/Arabic letters
- $\bar{Y}$: “Y bar” refers to the estimate of $\mu_Y$
- $s_X$: “S of X” refers to the estimate of $\sigma_X$

3.4 Variables and variation

3.4.1 Variables

In statistics, we are usually working with a (called a ‘data frame’ in R) that contains variables (columns) and observations (rows)

df_scorecard_edu %>% glimpse() # a dataset
#> Rows: 773
#> Columns: 24
#> $ opeid6                        <chr> "001009", "001009", "001009", "001009...
#> $ unitid                        <dbl> 100858, 100858, 100858, 100858, 10085...
#> $ instnm                        <chr> "Auburn University", "Auburn Universi...
#> $ control                       <chr+lbl> Public, Public, Public, Public, P...
#> $ ccbasic                       <dbl+lbl> 15, 15, 15, 15, 15, 15, 15, 15, 1...
#> $ stabbr                        <chr> "AL", "AL", "AL", "AL", "AL", "AL", "...
#> $ city                          <chr> "Auburn", "Auburn", "Auburn", "Auburn...
#> $ cipdig2                       <chr> "13", "13", "13", "13", "13", "13", "...
#> $ cipcode                       <chr> "1304", "1310", "1311", "1312", "1313...
#> $ cipdesc                       <chr> "Educational Administration and Super...
#> $ credlev                       <dbl+lbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
#> $ creddesc                      <chr> "Master's Degree", "Master's Degree",...
#> $ ipedscount1                   <chr> "53", "34", "27", "59", "61", "36", "...
#> $ ipedscount2                   <chr> "30", "39", "36", "64", "56", "22", "...
#> $ debt_all_stgp_eval_n          <dbl> 34, 38, 32, 52, 43, 45, 20, 40, 25, 2...
#> $ debt_all_stgp_eval_mean       <dbl> 29768, 43336, 41572, 31462, 28726, 35...
#> $ debt_all_stgp_eval_mdn        <dbl> 30942, 41000, 43500, 30000, 20500, 32...
#> $ debt_all_stgp_eval_mdn10yrpay <dbl> 318, 421, 447, 308, 210, 335, 295, 21...
#> $ earn_count_wne_hi_1yr         <dbl> 46, 34, 26, 56, 82, 51, 32, 52, 21, 3...
#> $ earn_mdn_hi_1yr               <dbl> 46321, 42905, 37568, 44495, 49179, 52...
#> $ earn_count_wne_hi_2yr         <dbl> 36, 24, 30, 40, 83, 40, 34, 36, 20, 4...
#> $ earn_mdn_hi_2yr               <dbl> 41844, 44021, 34603, 48041, 45249, 53...
#> $ region                        <dbl+lbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
#> $ locale                        <dbl+lbl> 13, 13, 13, 13, 13, 12, 12, 12, 1...

In general, a variable is something that varies

A variable contains all observations for a particular characteristic/measure

# the variable 'cipdesc' from the dataset df_scorecard_edu
df_scorecard_edu %>% select(cipdesc)
#> # A tibble: 773 x 1
#>    cipdesc                                           
#>    <chr>                                             
#>  1 Educational Administration and Supervision.       
#>  2 Special Education and Teaching.                   
#>  3 Student Counseling and Personnel Services.        
#>  4 Teacher Education and Professional Development, Sp
#>  5 Teacher Education and Professional Development, Sp
#>  6 Educational Administration and Supervision.       
#>  7 Special Education and Teaching.                   
#>  8 Teacher Education and Professional Development, Sp
#>  9 Teacher Education and Professional Development, Sp
#> 10 Educational Administration and Supervision.       
#> # ... with 763 more rows

# a different approach to showing the same variable
df_scorecard_edu$cipdesc[1:10]
#>  [1] "Educational Administration and Supervision."       
#>  [2] "Special Education and Teaching."                   
#>  [3] "Student Counseling and Personnel Services."        
#>  [4] "Teacher Education and Professional Development, Sp"
#>  [5] "Teacher Education and Professional Development, Sp"
#>  [6] "Educational Administration and Supervision."       
#>  [7] "Special Education and Teaching."                   
#>  [8] "Teacher Education and Professional Development, Sp"
#>  [9] "Teacher Education and Professional Development, Sp"
#> [10] "Educational Administration and Supervision."

Continuous Variables

Variables that take on a continuum of possible values where the distance between one value and another is meaningful
Examples: Age, income, GPA, test scores

Variable ipedscount2 = number of degrees awarded in most recent academic year reported

df_scorecard_edu %>% select(ipedscount2) %>% var_label() # variable label
#> $ipedscount2
#> [1] "Number of awards to all students in year 2 of the pooled debt cohort"

df_scorecard_edu %>% select(instnm,opeid6,cipdesc,ipedscount2)
#> # A tibble: 773 x 4
#>    instnm                   opeid6 cipdesc                           ipedscount2
#>    <chr>                    <chr>  <chr>                             <chr>      
#>  1 Auburn University        001009 Educational Administration and S~ 30         
#>  2 Auburn University        001009 Special Education and Teaching.   39         
#>  3 Auburn University        001009 Student Counseling and Personnel~ 36         
#>  4 Auburn University        001009 Teacher Education and Profession~ 64         
#>  5 Auburn University        001009 Teacher Education and Profession~ 56         
#>  6 The University of Alaba~ 001051 Educational Administration and S~ 22         
#>  7 The University of Alaba~ 001051 Special Education and Teaching.   28         
#>  8 The University of Alaba~ 001051 Teacher Education and Profession~ 65         
#>  9 The University of Alaba~ 001051 Teacher Education and Profession~ 38         
#> 10 University of Alabama a~ 001052 Educational Administration and S~ 30         
#> # ... with 763 more rows

# showing just the first ten values of variable ipedscount2
df_scorecard_edu$ipedscount2[1:10]
#>  [1] "30" "39" "36" "64" "56" "22" "28" "65" "38" "30"

Discrete Variables (or categorical variables)

Variables that can only take on specific (discrete) integer values
Can be nominal (e.g., school type), ordinal (e.g., likert scale), and quantitative (e.g., number of children)
- nominal: variable indicate which category an observation belongs to but should not be interpreted as high vs. low (e.g., marital status, religion)
- ordinal: variable values indicate whether one observation is higher or lower than another but distance between variable values is not meaningful (e.g., 1=very satisfied, 2=satisfied)
- quantitative: distance between one value and another is meaningful; like a continuous variable but takes on a small number of different values

Variable region = Geographic Census region the university is located in

this is a “nominal” categorical variable
each region is associated with a numeric value, but the numeric value isn’t meaningful in itself
each value of region has a “value label” to make it easier to interpret observations


# indicates this variable has "value labels
df_scorecard_edu %>% select(region) %>% glimpse()
#> Rows: 773
#> Columns: 1
#> $ region <dbl+lbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
df_scorecard_edu$region %>% class()
#> [1] "haven_labelled"

df_scorecard_edu %>% select(region) %>% var_label() # variable label
#> $region
#> [1] "Region (IPEDS)"

df_scorecard_edu %>% select(region) %>% val_labels() # label assigned to variable values
#> $region
#>                                       U.S. Service Schools 
#>                                                          0 
#>                       New England (CT, ME, MA, NH, RI, VT) 
#>                                                          1 
#>                          Mid East (DE, DC, MD, NJ, NY, PA) 
#>                                                          2 
#>                           Great Lakes (IL, IN, MI, OH, WI) 
#>                                                          3 
#>                        Plains (IA, KS, MN, MO, NE, ND, SD) 
#>                                                          4 
#> Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV) 
#>                                                          5 
#>                                 Southwest (AZ, NM, OK, TX) 
#>                                                          6 
#>                       Rocky Mountains (CO, ID, MT, UT, WY) 
#>                                                          7 
#>                          Far West (AK, CA, HI, NV, OR, WA) 
#>                                                          8 
#>            Outlying Areas (AS, FM, GU, MH, MP, PR, PW, VI) 
#>                                                          9

# print a few observations
df_scorecard_edu %>% select(instnm,opeid6,cipdesc,region)
#> # A tibble: 773 x 4
#>    instnm            opeid6 cipdesc                                       region
#>    <chr>             <chr>  <chr>                                      <dbl+lbl>
#>  1 Auburn University 001009 Educational Administrat~ 5 [Southeast (AL, AR, FL, ~
#>  2 Auburn University 001009 Special Education and T~ 5 [Southeast (AL, AR, FL, ~
#>  3 Auburn University 001009 Student Counseling and ~ 5 [Southeast (AL, AR, FL, ~
#>  4 Auburn University 001009 Teacher Education and P~ 5 [Southeast (AL, AR, FL, ~
#>  5 Auburn University 001009 Teacher Education and P~ 5 [Southeast (AL, AR, FL, ~
#>  6 The University o~ 001051 Educational Administrat~ 5 [Southeast (AL, AR, FL, ~
#>  7 The University o~ 001051 Special Education and T~ 5 [Southeast (AL, AR, FL, ~
#>  8 The University o~ 001051 Teacher Education and P~ 5 [Southeast (AL, AR, FL, ~
#>  9 The University o~ 001051 Teacher Education and P~ 5 [Southeast (AL, AR, FL, ~
#> 10 University of Al~ 001052 Educational Administrat~ 5 [Southeast (AL, AR, FL, ~
#> # ... with 763 more rows

# frequency count of the variable
df_scorecard_edu %>% count(region)
#> # A tibble: 8 x 2
#>                                                           region     n
#>                                                        <dbl+lbl> <int>
#> 1 1 [New England (CT, ME, MA, NH, RI, VT)]                          36
#> 2 2 [Mid East (DE, DC, MD, NJ, NY, PA)]                            117
#> 3 3 [Great Lakes (IL, IN, MI, OH, WI)]                             138
#> 4 4 [Plains (IA, KS, MN, MO, NE, ND, SD)]                           50
#> 5 5 [Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)]   232
#> 6 6 [Southwest (AZ, NM, OK, TX)]                                    93
#> 7 7 [Rocky Mountains (CO, ID, MT, UT, WY)]                          37
#> 8 8 [Far West (AK, CA, HI, NV, OR, WA)]                             70

# frequency count of the variable, but show value labels rather than underlying variable value
df_scorecard_edu %>% count(region) %>% as_factor()
#> # A tibble: 8 x 2
#>   region                                                         n
#>   <fct>                                                      <int>
#> 1 New England (CT, ME, MA, NH, RI, VT)                          36
#> 2 Mid East (DE, DC, MD, NJ, NY, PA)                            117
#> 3 Great Lakes (IL, IN, MI, OH, WI)                             138
#> 4 Plains (IA, KS, MN, MO, NE, ND, SD)                           50
#> 5 Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)   232
#> 6 Southwest (AZ, NM, OK, TX)                                    93
#> 7 Rocky Mountains (CO, ID, MT, UT, WY)                          37
#> 8 Far West (AK, CA, HI, NV, OR, WA)                             70

3.4.2 Types of variation

Several types of “variation” exist

Cross sectional data: data on different “observations” (e.g., students, classrooms, universities) for a single point in time [focus of this course]

e.g., California Test Score data where “observations” are districts

Time-Series Data: data on a single “observation” collected at multiple time points

e.g., US inflation and unemployment rate data 1959-2000
e.g., stock price for a single stock each minute of the day for 7 days

Longitudinal Data (or panel data): data on multiple “observations” at multiple time points

e.g., annual earnings for a person, measured every year for 10 years
usually, the dataset would have one observation for each person-year, so if we have longitudinal annual data on earnings over 10 years for 10 people, the dataset has 10 X 10 = 100 observations

3.4.3 Experimental vs. observational data

Experimental Data: obtained from experiments designed to assess the causal effect of a “treatment” on an outcome

each unit of analysis randomly assigned to “treatment” or “control” group
e.g., randomly assigned to enroll in an MA program at a private university (treatment) or a public university (control); and we are interested in measuring the effect of public vs. private on the outcomes of debt and earnings after program participation
randomized control trials (experiments) are the gold standard of program evaluation [will learn more about this]

Observational Data: obtained from surveys, administrative records [focus of this course]

researchers have/had no control on the “treatment”
- units (people, organizations) choose what they want to participate in
e.g., students decide whether they enroll in an MA program at a private university (treatment) or a public university (control); and we are interested in measuring the effect of public vs. private on the outcomes of debt and earnings after program participation

3.5 Descriptive stats: central tendancy vs. dispersion

Descriptive statistics describe data

Variables have measures of “central tendency” (e.g., mean, median, mode)
Variables have measures of “dispersion,” which identify how individual observations differ from one another (e.g., standard deviation, range)

3.5.1 Measures of Central Tendency: Sample Mean

Sample mean of Y or denoted as $\bar{Y}$

$\bar{Y} = \frac{\sum_{i=1}^{n} Y_i}{n}$
- where subscript i refers to the observation i

Example: Variable Y has the following six observations ($Y_1... Y_6$)

Obs: $Y_1 = 5, Y_1 = 2, Y_3 = 13, Y_4 = 11, Y_5 = 18, Y_6 = 22$
Calculate $\bar{Y}$

Calculate sample mean in R

x <- c(5,2,13,11,18,22)
length(x)
#> [1] 6

mean(x)
#> [1] 11.83333

# "by hand"
sum(x) # sum of x
#> [1] 71
length(x) # number of observations
#> [1] 6

sum(x)/length(x)
#> [1] 11.83333

Calculate mean value of (mean) debt from Stafford and grad plus, using first 10 observations

df_scorecard_edu %>% select(debt_all_stgp_eval_mean) %>% var_label()
#> $debt_all_stgp_eval_mean
#> [1] "Average Stafford and Grad PLUS loan debt disbursed at this institution"

df_scorecard_edu$debt_all_stgp_eval_mean[1:10]
#>  [1] 29768 43336 41572 31462 28726 35316 28032 32904 39299 25015

debt_i <- df_scorecard_edu$debt_all_stgp_eval_mean[1:10]
debt_i
#>  [1] 29768 43336 41572 31462 28726 35316 28032 32904 39299 25015

# calculate mean
mean(debt_i)
#> [1] 33543

# calculate by hand
sum(debt_i) # sum of values
#> [1] 335430
length(debt_i) # number of observations
#> [1] 10

sum(debt_i)/length(debt_i)
#> [1] 33543

Other measures of central tendency:

Median
Mode

median(debt_i)
#> [1] 32183

3.5.2 Measures of Dispersion: Standard Deviation

Sample standard deviation of Y or denoted as $\hat\sigma_Y$

Standard deviation is, on average, how far away a random observation, $Y_i$, is from the sample mean, $\bar{Y}$
$\hat\sigma_Y = \sqrt{\frac{\sum_{i=1}^N (Y_i - \overline{Y})^2}{N-1}}$
- “square root of sum of squared deviations divided by n-1”