Grade: /45
In this problem set, you will work with data from the College Scorecard and IPEDS. The College Scorecard is an initiative from the U.S. Department of Education to provide students and families with important information about colleges and universities in the U.S.– such as cost, debt, earnings etc. We will also be using IPEDS data that gathers information about every college and universities in the U.S. that receives federal financial aid. In this problem set, we will explore the relationship between cost of attendance (\(X\)) and earnings two years after graduating (\(Y\)) for graduates of MA programs in Education Administration and Supervision. In addition we will use the categorical \(X\) variable “Carnegie” to run a regression that examines the relationship between type of institution classification (\(X\)) and earnings two years after graduating (\(Y\)).
The problem set is divided into three parts:
If you have any questions about the problem set, please also post them on the #problemsets slack channel.
Some questions will ask you to write out notation and/or equations.
You can write out notation/equations one of two ways: (1) using “inline equations,” which begin with a dollar sign $ and end with a dollar sign $; OR (2) you can write out notation/equation in plain text without. We encourage you to try inline equations, but fine if you do not.
Tips on writing notation/equations using "inline equations’:
\beta: \(\beta\)\mu: \(\mu\)beta_1: \(\beta_1\)\mu_Y: \(\mu_Y\)\hat{} like this:
\hat{\beta}: \(\hat{\beta}\)\hat{\beta}_1 (note that the subscript is not within the “hat”): \(\hat{\beta}_1\)\bar{} like this:
\bar{Y}: \(\bar{Y}\)Tips on writing notation/equations in plain text
Please run the code in the following chunk, which does the following:
Note: code chunk omitted from html document using include = FALSE
/3
\(R^2 = \frac{\text{variance in Y that is explained by X}}{\text{total variance in Y}} = \frac{ESS}{TSS}\)
\(R^2 = 1 - \frac{\text{variance in Y not explained by X}}{\text{total variance in Y}} = 1 - \frac{SSR}{TSS}\)
/3
Consider the research question, what is the relationship between cost of attendance (\(X\)) and earnings (\(Y\)) (2 years after graduation)?
Below we use the lm() function and create an object named mod1 that contains results from the bivariate regression of the relationship between cost of attendance (\(X\)) and earnings 2 years after graduation (\(Y\)). We run the anova() function to get the values of ESS, SSR, TSS.
coa_grad_researn_mdn_hi_2yrmod1 <- lm(formula = earn_mdn_hi_2yr ~ coa_grad_res, data = df_edu)
summary(mod1)
#>
#> Call:
#> lm(formula = earn_mdn_hi_2yr ~ coa_grad_res, data = df_edu)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -30748 -7469 -2021 4669 54245
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 47019.11832 2113.15268 22.251 < 0.0000000000000002 ***
#> coa_grad_res 0.31536 0.07279 4.332 0.0000195 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 12380 on 338 degrees of freedom
#> (55 observations deleted due to missingness)
#> Multiple R-squared: 0.05261, Adjusted R-squared: 0.0498
#> F-statistic: 18.77 on 1 and 338 DF, p-value: 0.00001948
anova(mod1)
#> Analysis of Variance Table
#>
#> Response: earn_mdn_hi_2yr
#> Df Sum Sq Mean Sq F value Pr(>F)
#> coa_grad_res 1 2876756208 2876756208 18.768 0.00001948 ***
#> Residuals 338 51807412707 153276369
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimated sum of squares (ESS) = 2,876,756,208
Sum of Squared Residuals (SSR) = 51,807,412,707
Total Sum of Squares (TSS) = ESS + SSR = 54,684,168,915
\(TSS = \sum_{i=1}^{n} (Y_i-\bar{Y})^2\)
Total sum of squares measures the total variance in Y, in terms of \(\bar{Y}\)
\(ESS = \sum_{i=1}^{n} (\hat{Y_i}-\bar{Y})^2\)
Explained sum of squares measures the amount of variation in Y explained by X.
\(SSR = \sum_{i=1}^{n} (Y_i-\hat{Y})^2\)
The sum of squared residuals measures the amount of variation in Y not explained by X.
/2
#ESS
anova(mod1)$"Sum Sq"[1]
#> [1] 2876756208
#SSR
anova(mod1)$"Sum Sq"[2]
#> [1] 51807412707
#TSS
anova(mod1)$"Sum Sq"[1] + anova(mod1)$"Sum Sq"[2]
#> [1] 54684168915
#R2, ESS/TSS
anova(mod1)$"Sum Sq"[1] / (anova(mod1)$"Sum Sq"[1] + anova(mod1)$"Sum Sq"[2])
#> [1] 0.05260675
#R2, 1 - SSR/TSS
1 - anova(mod1)$"Sum Sq"[2] / (anova(mod1)$"Sum Sq"[1] + anova(mod1)$"Sum Sq"[2])
#> [1] 0.05260675
/4
Sample standard deviation = The sample standard deviation of Y \(\hat{\sigma}_Y\) measures the average distance between a random observation \(Y_i\) and the sample mean \(\bar{Y_i}\).
\(\hat{\sigma}_Y = \sqrt{\frac{\sum_{i=1}^{n} (Y_i-\bar{Y})^2}{n -1}}\)
Standard error of the regression (SER) = The standard error of the regression is an estimate of how far away, on average, an actual observed value of \(Y_i\) is from the predicted value of \(\hat{Y_i}\) of \(Y_i\) for a random observation, \(i\).
\(SER = \sqrt{\frac{\sum_{i=1}^{n} (Y_i-\hat{Y_i})^2}{n -2}} = \sqrt{\frac{\sum_{i=1}^{n} (\hat{\mu_i})^2}{n -2}}\)
/2
lm() and summary()) and assign it to the object mod2; report the SER; and calculate the standard deviation of the dependent variable sd(), debt debt_all_stgp_eval_mean, in the regression model.coa_grad_resdebt_all_stgp_eval_meanmod2 <- lm(formula = debt_all_stgp_eval_mean ~ coa_grad_res, data = df_edu)
summary(mod2)
#>
#> Call:
#> lm(formula = debt_all_stgp_eval_mean ~ coa_grad_res, data = df_edu)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -25356 -5537 -1466 4930 29968
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 18258.3119 1347.8901 13.546 <0.0000000000000002 ***
#> coa_grad_res 0.4527 0.0465 9.737 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 8378 on 393 degrees of freedom
#> Multiple R-squared: 0.1943, Adjusted R-squared: 0.1923
#> F-statistic: 94.8 on 1 and 393 DF, p-value: < 0.00000000000000022
#SER
summary(mod2)$sigma
#> [1] 8377.512
#Standard deviation of debt
sd(df_edu$debt_all_stgp_eval_mean, na.rm = TRUE)
#> [1] 9321.583
/3
Interpretation of SER: On average, observed values of institution-level student debt (\(Y_i\)) are 8,377.51 dollars away from predicted values of institution-level student debt \(\hat{Y_i}\).
Interpretation of sample standard deviation: On average, observations of \(Y_i\) are 9,321.58 dollars away from the sample mean \(\bar{Y_i}\) of Y.
In this section, we will explore the relationship between Carneige classification (\(X\)) and earnings (\(Y\)) 2 years after graduating for graduates of MA programs in Education Administration and Supervision. The \(X\) variable in this model is carnegie and it is a factor variable that represents a framework for classifying higher education institutions in the U.S. See here for more info. The \(Y\) variable in this model is earn_mdn_hi_2yr earnings two years after graduating.
carnegieearn_mdn_hi_2yr/1
/1
carnegie variable will be the reference group?Below is a frequency count of our \(X\) (factor) variable carnegie.
df_edu_fac %>% count(carnegie)
#> # A tibble: 4 x 2
#> carnegie n
#> <fct> <int>
#> 1 research 1 75
#> 2 research 2 140
#> 3 masters 1 135
#> 4 masters 2 45
df_edu_fac %>% count(as.integer(carnegie))
#> # A tibble: 4 x 2
#> `as.integer(carnegie)` n
#> <int> <int>
#> 1 1 75
#> 2 2 140
#> 3 3 135
#> 4 4 45
/2
/6
Some investigations of the variable categorical \(X\) variable urban
X= carnegie
Y= earn_mdn_hi_2yr
mod3 <- lm(formula = earn_mdn_hi_2yr ~ carnegie, data = df_edu_fac)
summary(mod3)
#>
#> Call:
#> lm(formula = earn_mdn_hi_2yr ~ carnegie, data = df_edu_fac)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -32770 -7465 -1999 4321 62838
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 51988 1519 34.228 < 0.0000000000000002 ***
#> carnegieresearch 2 2745 1898 1.446 0.149064
#> carnegiemasters 1 6346 1904 3.333 0.000955 ***
#> carnegiemasters 2 5450 2685 2.030 0.043189 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 12530 on 336 degrees of freedom
#> (55 observations deleted due to missingness)
#> Multiple R-squared: 0.03607, Adjusted R-squared: 0.02746
#> F-statistic: 4.191 on 3 and 336 DF, p-value: 0.006241
YOUR ANSWER HERE:
Population linear regression model: \(Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \beta_3X_{3i} + u_i\)
where:
earn_mdn_hi_2yr)carnegie), which has the following four categories: Research 1 [refrence group]; Research 2; Masters 1; Masters 2
OLS prediction line (without estimates): \(\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1}X_{1i} + \hat{\beta_2}X_{2i} + \hat{\beta_3}X_{3i}\)
where:
earn_mdn_hi_2yr)OLS prediction line (with estimates): \(\hat{Y_i} =\) 51,988.34 + 2,745.27 \(\times X_{1i}\) + 6,346.29 \(\times X_{2i}\) + 5,449.51 \(\times X_{3i}\)
/3
YOUR ANSWER HERE:
Interpretation of \(\hat{\beta_1}=\) 2,745.27 (Research 2)
Interpretation of \(\hat{\beta_2}=\) 6,346.29 (suburb)
Interpretation of \(\hat{\beta_3}=\) 5,449.51 (town/rural)
/1
YOUR ANSWER HERE:
Interpret point estimate value of \(\hat{\beta}_0\):
/3
OLS line with estimates
Calculation:
If all other values of the independent variable are 0 (\(X_2\),\(X_3\)) except \(X_1\), then our OLS prediction line looks like this:
\(\hat{Y_i} =\) 51,988.34 + 2,745.27 \(\times X_{1i}\) + 6,346.29 \(\times 0\) + 5,449.51 \(\times 0\)
\(\hat{Y_i} =\) 54733.6033058
Now do the following for each value of X
Non reference group 2 (\(X_2\)) = (Master’s 1)
Non reference group 3 (\(X_3\)) = (Master’s 2)
Reference group = Research 1
YOUR ANSWER HERE:
Non reference group 2 (\(X_2\)) = \(\hat{Y_i} =\) 51,988.34 + 2,745.27 \(\times 0\) + 6,346.29 \(\times X_{2i}\) + 5,449.51 \(\times 0\)
= \(\hat{Y_i} =\) 58334.6302521
Non reference group 3 (\(X_3\)) = \(\hat{Y_i} =\) 51,988.34 + 2,745.27 \(\times 0\) + 6,346.29 \(\times 0\) + 5,449.51 \(\times X_{3i}\)
= \(\hat{Y_i} =\) 57437.84375
Reference group = \(\hat{Y_i} =\) 51,988.34 + 2,745.27 \(\times 0\) + 6,346.29 \(\times 0\) + 5,449.51 \(\times 0\)
= \(\hat{Y_i} =\) 51988.3382353
/3
State the null and alternative hypothesis
Solve for value of t using information from the regression output
Using the output from the model, interpret the p-value in words and make a conclusion.
YOUR ANSWER HERE:
Hypothesis - \(H_0: \beta_1 = 0\) - \(H_a: \beta_1 \ne 0\)
T-value
0.05, so we do not reject \(H_0\)./1
/1
/1
/3
\(\hat{\beta}_1\) = “being located in a medium/small city as opposed to large city is, on average, associated with a 2,745.27 dollar change in institution-level student earnings for MA graduates of education programs”
/2
Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”