Grade: /45
In this problem set, you will work with data from the College Scorecard and IPEDS. The College Scorecard is an initiative from the U.S. Department of Education to provide students and families with important information about colleges and universities in the U.S.– such as cost, debt, earnings etc. We will also be using IPEDS data that gathers information about every college and universities in the U.S. that receives federal financial aid. In this problem set, we will explore the relationship between cost of attendance (\(X\)) and earnings two years after graduating (\(Y\)) for graduates of MA programs in Education Administration and Supervision. In addition we will use the categorical \(X\) variable “Carnegie” to run a regression that examines the relationship between type of institution classification (\(X\)) and earnings two years after graduating (\(Y\)).
The problem set is divided into three parts:
If you have any questions about the problem set, please also post them on the #problemsets slack channel.
Some questions will ask you to write out notation and/or equations.
You can write out notation/equations one of two ways: (1) using “inline equations,” which begin with a dollar sign $ and end with a dollar sign $; OR (2) you can write out notation/equation in plain text without. We encourage you to try inline equations, but fine if you do not.
Tips on writing notation/equations using "inline equations’:
\beta: \(\beta\)\mu: \(\mu\)beta_1: \(\beta_1\)\mu_Y: \(\mu_Y\)\hat{} like this:
\hat{\beta}: \(\hat{\beta}\)\hat{\beta}_1 (note that the subscript is not within the “hat”): \(\hat{\beta}_1\)\bar{} like this:
\bar{Y}: \(\bar{Y}\)Tips on writing notation/equations in plain text
Please run the code in the following chunk, which does the following:
Note: code chunk omitted from html document using include = FALSE
/3
/3
Consider the research question, what is the relationship between cost of attendance (\(X\)) and earnings (\(Y\)) (2 years after graduation)?
Below we use the lm() function and create an object named mod1 that contains results from the bivariate regression of the relationship between cost of attendance (\(X\)) and earnings 2 years after graduation (\(Y\)). We run the anova() function to get the values of ESS, SSR, TSS.
coa_grad_researn_mdn_hi_2yrmod1 <- lm(formula = earn_mdn_hi_2yr ~ coa_grad_res, data = df_edu)
summary(mod1)
#>
#> Call:
#> lm(formula = earn_mdn_hi_2yr ~ coa_grad_res, data = df_edu)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -30748 -7469 -2021 4669 54245
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 47019.11832 2113.15268 22.251 < 0.0000000000000002 ***
#> coa_grad_res 0.31536 0.07279 4.332 0.0000195 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 12380 on 338 degrees of freedom
#> (55 observations deleted due to missingness)
#> Multiple R-squared: 0.05261, Adjusted R-squared: 0.0498
#> F-statistic: 18.77 on 1 and 338 DF, p-value: 0.00001948
anova(mod1)
#> Analysis of Variance Table
#>
#> Response: earn_mdn_hi_2yr
#> Df Sum Sq Mean Sq F value Pr(>F)
#> coa_grad_res 1 2876756208 2876756208 18.768 0.00001948 ***
#> Residuals 338 51807412707 153276369
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
YOUR ANSWER HERE: TSS:
YOUR ANSWER HERE: ESS:
YOUR ANSWER HERE: SSR:
/2
/4
/2
lm() and summary()) and assign it to the object mod2; report the SER; and calculate the standard deviation of the dependent variable sd(), debt debt_all_stgp_eval_mean, in the regression model.coa_grad_resdebt_all_stgp_eval_mean/3
In this section, we will explore the relationship between Carneige classification (\(X\)) and earnings (\(Y\)) 2 years after graduating for graduates of MA programs in Education Administration and Supervision. The \(X\) variable in this model is carnegie and it is a factor variable that represents a framework for classifying higher education institutions in the U.S. See here for more info. The \(Y\) variable in this model is earn_mdn_hi_2yr earnings two years after graduating.
carnegieearn_mdn_hi_2yr/1
/1
carnegie variable will be the reference group?Below is a frequency count of our \(X\) (factor) variable carnegie.
df_edu_fac %>% count(carnegie)
#> # A tibble: 4 x 2
#> carnegie n
#> <fct> <int>
#> 1 research 1 75
#> 2 research 2 140
#> 3 masters 1 135
#> 4 masters 2 45
df_edu_fac %>% count(as.integer(carnegie))
#> # A tibble: 4 x 2
#> `as.integer(carnegie)` n
#> <int> <int>
#> 1 1 75
#> 2 2 140
#> 3 3 135
#> 4 4 45
/2
/6
Some investigations of the variable categorical \(X\) variable urban
X= carnegie
Y= earn_mdn_hi_2yr
YOUR ANSWER HERE:
/3
/1
/3
OLS line with estimates
Calculation:
If all other values of the independent variable are 0 (\(X_2\),\(X_3\)) except \(X_1\), then our OLS prediction line looks like this:
\(\hat{Y_i} =\) \(51,988.34 + 2745*X_{1i} + 6346 *0 + 5450*0\)
\(\hat{Y_i} =\) \(51,988.34 + 2745\)
\(\hat{Y_i} =\) \(54,733.34\)
Now do the following for each value of X
Non reference group 2 (\(X_2\)) = (Master’s 1)
Non reference group 3 (\(X_3\)) = (Master’s 2)
Reference group = Research 1
YOUR ANSWER HERE:
/3
State the null and alternative hypothesis
Solve for value of t using information from the regression output
Using the output from the model, interpret the p-value in words and make a conclusion.
YOUR ANSWER HERE:
/1
/1
/1
/3
/2
Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”