Grade: /45

Overview

In this problem set, you will work with data from the College Scorecard and IPEDS. The College Scorecard is an initiative from the U.S. Department of Education to provide students and families with important information about colleges and universities in the U.S.– such as cost, debt, earnings etc. We will also be using IPEDS data that gathers information about every college and universities in the U.S. that receives federal financial aid. In this problem set, we will explore the relationship between cost of attendance (\(X\)) and earnings two years after graduating (\(Y\)) for graduates of MA programs in Education Administration and Supervision. In addition we will use the categorical \(X\) variable “Carnegie” to run a regression that examines the relationship between type of institution classification (\(X\)) and earnings two years after graduating (\(Y\)).

The problem set is divided into three parts:

  • In part I, you will answer questions about model fit (\(R^2\) & SER)
  • In part II, you will answer questions about categorical \(X\) variables
  • In part III, you will answer questions about confidence interval

If you have any questions about the problem set, please also post them on the #problemsets slack channel.

Click here for tips on notation


Some questions will ask you to write out notation and/or equations.

You can write out notation/equations one of two ways: (1) using “inline equations,” which begin with a dollar sign $ and end with a dollar sign $; OR (2) you can write out notation/equation in plain text without. We encourage you to try inline equations, but fine if you do not.

Tips on writing notation/equations using "inline equations’:

  • Make sure there are no spaces after the dollar sign $ that begins the equation and no spaces before the dollar sign that ends the equation.
    • For example, you would write out the notation for treated potential outcome like this: \(Y_i(1)\)
    • But this wouldn’t work: $ Y_i(1)$
    • And this wouldn’t work: $Y_i(1) $
  • Special characters – like greek letters – within inline equations are referred to using special symbols that start with a backslash
    • e.g., “Beta” is \beta: \(\beta\)
    • “Mu” (symbol for population mean) is \mu: \(\mu\)
  • Subscripts after a character or symbol are specified like this:
    • e.g., “Beta subscript 1” is beta_1: \(\beta_1\)
    • e.g., “Mu subscript Y” (referring to population mean of variable Y) is \mu_Y: \(\mu_Y\)
  • “hats” are specified by wrapping the character/symbol within curly brackets \hat{} like this:
    • e.g., “Beta hat” is \hat{\beta}: \(\hat{\beta}\)
    • e.g., “Beta hat subscript 1” is \hat{\beta}_1 (note that the subscript is not within the “hat”): \(\hat{\beta}_1\)
  • “bars” are specified by wrapping the character/symbol within curly brackets \bar{} like this:
    • e.g., “sample mean of Y” is \bar{Y}: \(\bar{Y}\)
  • Don’t worry about getting it perfect and don’t spend too much time trying to get it perfect; if you are trying, that is a great start! and fine to use inline equations for some notation/equations and plain text for others that you can’t figure out.


Tips on writing notation/equations in plain text

  • Instead of writing \(Y_i(1)\), you could write this: Y_i(1)
  • Instead of writing \(Y_i = \beta_0 + \beta_1X_i + u_i\), you could write this: Y_i = beta_0 + beta_1*X_i + u_i
  • Instead of writing \(\hat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1X_i\), you could write something like this: Y_hat_i = beta_hat_0 + beta_hat_1*X_i
  • don’t worry if it doesn’t look pretty!

Load libraries and data

Please run the code in the following chunk, which does the following:

  • Loads libraries
  • Loads and creates data frame from IPEDS/College Scorecard Masters degrees in Education

Note: code chunk omitted from html document using include = FALSE

Part I: Measures of model fit


/3

1. Define \(R^2\) in words. Write out the mathematical formula for \(R^2\) using both ways discussed in lecture.

  • YOUR ANSWER HERE: \(R^2\) is the fraction of variance in Y explained by X (and is not already explained by sample mean, \(\bar{Y}\)).

\(R^2 = \frac{\text{variance in Y that is explained by X}}{\text{total variance in Y}} = \frac{ESS}{TSS}\)

\(R^2 = 1 - \frac{\text{variance in Y not explained by X}}{\text{total variance in Y}} = 1 - \frac{SSR}{TSS}\)

/3

2. Using the code from below to guide you, write out the formula for TSS and explain what it means in words. Do the same for ESS and SSR.

Consider the research question, what is the relationship between cost of attendance (\(X\)) and earnings (\(Y\)) (2 years after graduation)?

Below we use the lm() function and create an object named mod1 that contains results from the bivariate regression of the relationship between cost of attendance (\(X\)) and earnings 2 years after graduation (\(Y\)). We run the anova() function to get the values of ESS, SSR, TSS.

  • X= coa_grad_res
  • Y= earn_mdn_hi_2yr
mod1 <- lm(formula = earn_mdn_hi_2yr ~ coa_grad_res, data = df_edu)

summary(mod1)
#> 
#> Call:
#> lm(formula = earn_mdn_hi_2yr ~ coa_grad_res, data = df_edu)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -30748  -7469  -2021   4669  54245 
#> 
#> Coefficients:
#>                 Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)  47019.11832  2113.15268  22.251 < 0.0000000000000002 ***
#> coa_grad_res     0.31536     0.07279   4.332            0.0000195 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 12380 on 338 degrees of freedom
#>   (55 observations deleted due to missingness)
#> Multiple R-squared:  0.05261,    Adjusted R-squared:  0.0498 
#> F-statistic: 18.77 on 1 and 338 DF,  p-value: 0.00001948

anova(mod1)
#> Analysis of Variance Table
#> 
#> Response: earn_mdn_hi_2yr
#>               Df      Sum Sq    Mean Sq F value     Pr(>F)    
#> coa_grad_res   1  2876756208 2876756208  18.768 0.00001948 ***
#> Residuals    338 51807412707  153276369                       
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimated sum of squares (ESS) = 2,876,756,208
Sum of Squared Residuals (SSR) = 51,807,412,707
Total Sum of Squares (TSS) = ESS + SSR = 54,684,168,915

  • YOUR ANSWER HERE: TSS:

\(TSS = \sum_{i=1}^{n} (Y_i-\bar{Y})^2\)

Total sum of squares measures the total variance in Y, in terms of \(\bar{Y}\)

  • YOUR ANSWER HERE: ESS:

\(ESS = \sum_{i=1}^{n} (\hat{Y_i}-\bar{Y})^2\)

Explained sum of squares measures the amount of variation in Y explained by X.

  • YOUR ANSWER HERE: SSR:

\(SSR = \sum_{i=1}^{n} (Y_i-\hat{Y})^2\)

The sum of squared residuals measures the amount of variation in Y not explained by X.

/2

3. Using the values for ESS, SSR, and TSS from above, calculate \(R^2\) in both ways discussed in lecture (can do it by hand below or in a code chunk). Interpret the value of \(R^2\) in words.

  • YOUR ANSWER HERE: The model explains 5.2% of the variation in Y (that is not already explained by sample mean, \(\bar{Y}\))
#ESS
anova(mod1)$"Sum Sq"[1]
#> [1] 2876756208

#SSR
anova(mod1)$"Sum Sq"[2]
#> [1] 51807412707

#TSS
anova(mod1)$"Sum Sq"[1] + anova(mod1)$"Sum Sq"[2]
#> [1] 54684168915

#R2, ESS/TSS
anova(mod1)$"Sum Sq"[1] / (anova(mod1)$"Sum Sq"[1] + anova(mod1)$"Sum Sq"[2])
#> [1] 0.05260675

#R2, 1 - SSR/TSS
1 - anova(mod1)$"Sum Sq"[2] / (anova(mod1)$"Sum Sq"[1] + anova(mod1)$"Sum Sq"[2])
#> [1] 0.05260675

/4

4. Explain what sample standard deviation of a variable means in words and write out the formula. Explain what standard error of the regression (SER) means in words and write out the formula for SER (in terms of SSR).

  • YOUR ANSWER HERE:

Sample standard deviation = The sample standard deviation of Y \(\hat{\sigma}_Y\) measures the average distance between a random observation \(Y_i\) and the sample mean \(\bar{Y_i}\).

\(\hat{\sigma}_Y = \sqrt{\frac{\sum_{i=1}^{n} (Y_i-\bar{Y})^2}{n -1}}\)

Standard error of the regression (SER) = The standard error of the regression is an estimate of how far away, on average, an actual observed value of \(Y_i\) is from the predicted value of \(\hat{Y_i}\) of \(Y_i\) for a random observation, \(i\).

\(SER = \sqrt{\frac{\sum_{i=1}^{n} (Y_i-\hat{Y_i})^2}{n -2}} = \sqrt{\frac{\sum_{i=1}^{n} (\hat{\mu_i})^2}{n -2}}\)

/2

5. Run the analysis of the relationship between cost of attendance (\(X\)) and debt (\(Y\))., do the following: run the regression in R (using lm() and summary()) and assign it to the object mod2; report the SER; and calculate the standard deviation of the dependent variable sd(), debt debt_all_stgp_eval_mean, in the regression model.

  • X= coa_grad_res
  • Y= debt_all_stgp_eval_mean
mod2 <- lm(formula = debt_all_stgp_eval_mean ~ coa_grad_res, data = df_edu)

summary(mod2)
#> 
#> Call:
#> lm(formula = debt_all_stgp_eval_mean ~ coa_grad_res, data = df_edu)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -25356  -5537  -1466   4930  29968 
#> 
#> Coefficients:
#>                Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept)  18258.3119  1347.8901  13.546 <0.0000000000000002 ***
#> coa_grad_res     0.4527     0.0465   9.737 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 8378 on 393 degrees of freedom
#> Multiple R-squared:  0.1943, Adjusted R-squared:  0.1923 
#> F-statistic:  94.8 on 1 and 393 DF,  p-value: < 0.00000000000000022

#SER
summary(mod2)$sigma
#> [1] 8377.512

#Standard deviation of debt
sd(df_edu$debt_all_stgp_eval_mean, na.rm = TRUE)
#> [1] 9321.583

/3

6. Interpret the SER from the above model in words. Interpret sample standard deviation from above model in words. Does our model make our prediction substantially better?

  • YOUR ANSWER HERE:

Interpretation of SER: On average, observed values of institution-level student debt (\(Y_i\)) are 8,377.51 dollars away from predicted values of institution-level student debt \(\hat{Y_i}\).

Interpretation of sample standard deviation: On average, observations of \(Y_i\) are 9,321.58 dollars away from the sample mean \(\bar{Y_i}\) of Y.

  • Yes, the model makes our prediction better.

Part II: Categorical X variables

In this section, we will explore the relationship between Carneige classification (\(X\)) and earnings (\(Y\)) 2 years after graduating for graduates of MA programs in Education Administration and Supervision. The \(X\) variable in this model is carnegie and it is a factor variable that represents a framework for classifying higher education institutions in the U.S. See here for more info. The \(Y\) variable in this model is earn_mdn_hi_2yr earnings two years after graduating.

  • \(X\) = carnegie
  • \(Y\) = earn_mdn_hi_2yr

/1

1. Explain what a reference group is in words.

  • YOUR ANSWER HERE: The reference group is the group in our model that all other groups will be compared to.

/1

2. In our analysis of the relationship between carnegie (\(X\)) and earnings (\(Y\)), which category of the carnegie variable will be the reference group?

Below is a frequency count of our \(X\) (factor) variable carnegie.

df_edu_fac %>% count(carnegie)
#> # A tibble: 4 x 2
#>   carnegie       n
#>   <fct>      <int>
#> 1 research 1    75
#> 2 research 2   140
#> 3 masters 1    135
#> 4 masters 2     45
df_edu_fac %>% count(as.integer(carnegie))
#> # A tibble: 4 x 2
#>   `as.integer(carnegie)`     n
#>                    <int> <int>
#> 1                      1    75
#> 2                      2   140
#> 3                      3   135
#> 4                      4    45
  • YOUR ANSWER HERE: Research 1 institutions will be the reference group in our model because R automatically assigns the lowest value of \(X\) as the reference group.

/2

3. What is a factor variable and why is it important for running a regression?

  • YOUR ANSWER HERE: A factor variable is a vector of integer values. When running a regression with a categorical variable (X), the variable should be class factor as values are stored as integers (e.g., 1,0) rather than strings (e.g., “Married”, “Single”, etc).

/6

4. For your analysis of the relationship between classification of university (\(X\)) and institution-level student earnings (\(Y\)) for graduates of MA programs in Education Administration and Supervision, do the following: write out the population linear regression model (label the symbols and write out what the variables \(X\) and \(Y\) actually represent); write out the OLS prediction line (without estimate values); write out the OLS prediction line (with estimate values);

  • Some investigations of the variable categorical \(X\) variable urban

  • X= carnegie

  • Y= earn_mdn_hi_2yr

mod3 <- lm(formula = earn_mdn_hi_2yr ~ carnegie, data = df_edu_fac)
summary(mod3)
#> 
#> Call:
#> lm(formula = earn_mdn_hi_2yr ~ carnegie, data = df_edu_fac)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -32770  -7465  -1999   4321  62838 
#> 
#> Coefficients:
#>                    Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)           51988       1519  34.228 < 0.0000000000000002 ***
#> carnegieresearch 2     2745       1898   1.446             0.149064    
#> carnegiemasters 1      6346       1904   3.333             0.000955 ***
#> carnegiemasters 2      5450       2685   2.030             0.043189 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 12530 on 336 degrees of freedom
#>   (55 observations deleted due to missingness)
#> Multiple R-squared:  0.03607,    Adjusted R-squared:  0.02746 
#> F-statistic: 4.191 on 3 and 336 DF,  p-value: 0.006241
  • YOUR ANSWER HERE:

  • Population linear regression model: \(Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \beta_3X_{3i} + u_i\)

  • where:

    • subscript \(i\) refers to university \(i\)
    • \(Y_i\) = institution-level student earnings (in dollars) at university i (measured by variable earn_mdn_hi_2yr)
    • \(X_i\) = type of institution (measured by carnegie), which has the following four categories: Research 1 [refrence group]; Research 2; Masters 1; Masters 2
      • \(X_{1i}\): 0/1 “research 2”
      • \(X_{2i}\): 0/1 “masters 1”
      • \(X_{3i}\): 0/1 “masters 2”
    • \(\beta_0\) = (“population intercept”), average value of \(Y\) when \(X\) is the reference group category (that is \(X_{1i}\)=0,\(X_{2i}\)=0,\(X_{3i}\)=0
    • \(\beta_1\) = population regression coefficient associated with being a “research 2” rather than reference group
    • \(\beta_2\) = population regression coefficient associated with being a “masters 1” rather than reference group
    • \(\beta_3\) = population regression coefficient associated with being a “masters 2” rather than reference group
  • OLS prediction line (without estimates): \(\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1}X_{1i} + \hat{\beta_2}X_{2i} + \hat{\beta_3}X_{3i}\)

  • where:

    • \(\hat{Y_i}\) = predicted value of institution-level student earnings (in dollars) at university i (measured by variable earn_mdn_hi_2yr)
    • \(\hat{\beta_0}\) = predicted value of \(Y\) when all independent variables in the model (\(X_1\),\(X_2\),…\(X_k\)) are equal to 0.
    • \(\hat{\beta_1}\)= predicted value of \(Y\) when \(X_1\) = 1 “research 2”.
    • \(\hat{\beta_2}\)= predicted value of \(Y\) when \(X_2\) = 1 “masters 1”.
    • \(\hat{\beta_3}\)= predicted value of \(Y\) when \(X_3\) = 1 “masters 2”.
  • OLS prediction line (with estimates): \(\hat{Y_i} =\) 51,988.34 + 2,745.27 \(\times X_{1i}\) + 6,346.29 \(\times X_{2i}\) + 5,449.51 \(\times X_{3i}\)

/3

5. Interpet the point estimate value(s) of \(\hat{\beta}_1\), \(\hat{\beta}_2\), \(\hat{\beta}_3\) in words.

  • YOUR ANSWER HERE:

  • Interpretation of \(\hat{\beta_1}=\) 2,745.27 (Research 2)

    • “Graduating from a Research 2 university as opposed to a Research 1 university is, on average, associated with a 2,745.27 dollar change in institution-level student earnings for MA graduates of education programs”
  • Interpretation of \(\hat{\beta_2}=\) 6,346.29 (suburb)

    • “Graduating from a Masters 1 university as opposed to a Research 1 university is, on average, associated with a 6,346.29 dollar change in institution-level student earnings for for MA graduates of education programs”
  • Interpretation of \(\hat{\beta_3}=\) 5,449.51 (town/rural)

    • "Graduating from a Masters 2 university as opposed to a Research 1 university is, on average, associated with a 5,449.51 dollar change in institution-level student earnings for MA graduates of education programs.

/1

6. Interpet the point estimate value of \(\hat{\beta}_0\) in words.

  • YOUR ANSWER HERE:

  • Interpret point estimate value of \(\hat{\beta}_0\):

    • interpretation of \(\hat{\beta_0}=\) 51,988 is the predicted institution-level student earnings for MA graduates of education programs who attended a Research 1 university.

/3

7. What is the predicted value of \(Y\) for each of the following university types. Show work (OLS prediction line w/ estimates; then result of calculation).

OLS line with estimates

  • Hint: Our OLS prediction line looks something like this:\(\hat{Y_i} =\) 51,988.34 + 2,745.27 \(\times X_{1i}\) + 6,346.29 \(\times X_{2i}\) + 5,449.51 \(\times X_{3i}\)

Calculation:

  • If all other values of the independent variable are 0 (\(X_2\),\(X_3\)) except \(X_1\), then our OLS prediction line looks like this:

    • \(\hat{Y_i} =\) 51,988.34 + 2,745.27 \(\times X_{1i}\) + 6,346.29 \(\times 0\) + 5,449.51 \(\times 0\)

    • \(\hat{Y_i} =\) 54733.6033058

Now do the following for each value of X

  • Non reference group 2 (\(X_2\)) = (Master’s 1)

  • Non reference group 3 (\(X_3\)) = (Master’s 2)

  • Reference group = Research 1

  • YOUR ANSWER HERE:

Non reference group 2 (\(X_2\)) = \(\hat{Y_i} =\) 51,988.34 + 2,745.27 \(\times 0\) + 6,346.29 \(\times X_{2i}\) + 5,449.51 \(\times 0\)

= \(\hat{Y_i} =\) 58334.6302521

Non reference group 3 (\(X_3\)) = \(\hat{Y_i} =\) 51,988.34 + 2,745.27 \(\times 0\) + 6,346.29 \(\times 0\) + 5,449.51 \(\times X_{3i}\)

= \(\hat{Y_i} =\) 57437.84375

Reference group = \(\hat{Y_i} =\) 51,988.34 + 2,745.27 \(\times 0\) + 6,346.29 \(\times 0\) + 5,449.51 \(\times 0\)

= \(\hat{Y_i} =\) 51988.3382353

/3

8. For one of the non-reference group categories of \(X\) (e.g., Research 2, Master’s 1, Master’s 2) do the following:

  • State the null and alternative hypothesis

  • Solve for value of t using information from the regression output

  • Using the output from the model, interpret the p-value in words and make a conclusion.

  • YOUR ANSWER HERE:

Hypothesis - \(H_0: \beta_1 = 0\) - \(H_a: \beta_1 \ne 0\)

T-value

  • calculate t-statistic
    • \(\hat{\beta}_1\): 2745.2651
    • \(SE(\hat{\beta}_1)\): 1898.3124
    • \(t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}=\) (2745.2651)/(1898.3124) = 1.4462
  • p-value associated with \(t\): 0.1491
    • interpretation: Under the assumption that \(H_0: \beta_1 =0\) is true, there is a 14.9 percent chance of obtaining a point estimate \(\hat{\beta_1}\) as far away from the hypothesized value (\(\beta_1 =0\)) as the one we observed.
    • p-value of 0.1491 is greater than the alpha-level of 0.05, so we do not reject \(H_0\).

Part III: Confidence interval about \(\beta_k\)

/1

1. In words, what is a confidence interval?

  • YOUR ANSWER HERE: A confidence interval gives us a range of values that contain the true value of a parameter with a prespecified probability.

/1

2. What is the formula for a confidence interval about some population parameter?

  • YOUR ANSWER HERE: \(\bar{Y} \pm z*SE(\bar{Y})\)

/1

3. What is the formula for a 95% confidence interval about a population regression coefficient \(\hat{\beta}\)

  • YOUR ANSWER HERE: \(\hat{\beta}_k \pm 1.96*SE(\hat{\beta}_k)\)

/3

4. For a non-reference group category of variable \(X\) (a different category than the one you chose for previous set of questions), do the following:

  • Interpret \(\hat{\beta}_k\) in words
  • Calculate the 95% confidence interval
  • Interpret the 95% confidence interval in words


  • YOUR ANSWER HERE:

\(\hat{\beta}_1\) = “being located in a medium/small city as opposed to large city is, on average, associated with a 2,745.27 dollar change in institution-level student earnings for MA graduates of education programs”

  • Formula: \(\hat{\beta}_k \pm 1.96*SE(\hat{\beta}_k)\)
    • \(\hat{\beta_1}=\) 2745.2651
    • \(SE(\hat{\beta_1})=\) 1898.3124
    • lower bound: = \(\hat{\beta}_k - 1.96*SE(\hat{\beta}_k)=\) -975.43
    • upper bound: = \(\hat{\beta}_k + 1.96*SE(\hat{\beta}_k)=\) 6465.96
  • Interpretation:
    • We are 95% confident that the population parameter \(\beta_1\) lies somewhere between -975.43 and 6465.96

Part IV: Post a comment/question

/2

  • Go to the class #problemsets channel and create a new post.
  • You can either:
    • Share something you learned or a question from this problem set. Make sure to mention the instructors (@ozanj, @Patricia Martín).
    • Respond to a post made by another student.

Knit to html and submit problem set

Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”

  • Go to the class website and under the “Readings & Assignments” >> “Week 7” tab, click on the “Problem set 3 submission link”
  • Submit both your html and .Rmd files
  • Use this naming convention “lastname_firstname_ps#” for your .Rmd (e.g. martin_patricia_ps3.Rmd)