EDUC152, Problem Set #3

Grade: /45

Overview

In this problem set, you will work with data from the College Scorecard and IPEDS. The College Scorecard is an initiative from the U.S. Department of Education to provide students and families with important information about colleges and universities in the U.S.– such as cost, debt, earnings etc. We will also be using IPEDS data that gathers information about every college and universities in the U.S. that receives federal financial aid. In this problem set, we will explore the relationship between cost of attendance ($X$) and earnings two years after graduating ($Y$) for graduates of MA programs in Education Administration and Supervision. In addition we will use the categorical $X$ variable “Carnegie” to run a regression that examines the relationship between type of institution classification ($X$) and earnings two years after graduating ($Y$).

The problem set is divided into three parts:

In part I, you will answer questions about model fit ($R^2$ & SER)

In part II, you will answer questions about categorical $X$ variables

In part III, you will answer questions about confidence interval

If you have any questions about the problem set, please also post them on the #problemsets slack channel.

Click here for tips on notation

Some questions will ask you to write out notation and/or equations.

You can write out notation/equations one of two ways: (1) using “inline equations,” which begin with a dollar sign $ and end with a dollar sign $; OR (2) you can write out notation/equation in plain text without. We encourage you to try inline equations, but fine if you do not.

Tips on writing notation/equations using "inline equations’:

Make sure there are no spaces after the dollar sign $ that begins the equation and no spaces before the dollar sign that ends the equation.

For example, you would write out the notation for treated potential outcome like this: $Y_i(1)$

But this wouldn’t work: $ Y_i(1)$

And this wouldn’t work: $Y_i(1) $

Special characters – like greek letters – within inline equations are referred to using special symbols that start with a backslash

e.g., “Beta” is \beta: $\beta$

“Mu” (symbol for population mean) is \mu: $\mu$

Subscripts after a character or symbol are specified like this:

e.g., “Beta subscript 1” is beta_1: $\beta_1$

e.g., “Mu subscript Y” (referring to population mean of variable Y) is \mu_Y: $\mu_Y$

“hats” are specified by wrapping the character/symbol within curly brackets \hat{} like this:

e.g., “Beta hat” is \hat{\beta}: $\hat{\beta}$

e.g., “Beta hat subscript 1” is \hat{\beta}_1 (note that the subscript is not within the “hat”): $\hat{\beta}_1$

“bars” are specified by wrapping the character/symbol within curly brackets \bar{} like this:

e.g., “sample mean of Y” is \bar{Y}: $\bar{Y}$

Don’t worry about getting it perfect and don’t spend too much time trying to get it perfect; if you are trying, that is a great start! and fine to use inline equations for some notation/equations and plain text for others that you can’t figure out.

Tips on writing notation/equations in plain text

Instead of writing $Y_i(1)$, you could write this: Y_i(1)

Instead of writing $Y_i = \beta_0 + \beta_1X_i + u_i$, you could write this: Y_i = beta_0 + beta_1*X_i + u_i

Instead of writing $\hat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1X_i$, you could write something like this: Y_hat_i = beta_hat_0 + beta_hat_1*X_i

don’t worry if it doesn’t look pretty!

Load libraries and data

Please run the code in the following chunk, which does the following:

Loads libraries

Loads and creates data frame from IPEDS/College Scorecard Masters degrees in Education

Note: code chunk omitted from html document using include = FALSE

Part I: Measures of model fit

/3

1. Define $R^2$ in words. Write out the mathematical formula for $R^2$ using both ways discussed in lecture.

YOUR ANSWER HERE:

/3

2. Using the code from below to guide you, write out the formula for TSS and explain what it means in words. Do the same for ESS and SSR.

Consider the research question, what is the relationship between cost of attendance ($X$) and earnings ($Y$) (2 years after graduation)?

Below we use the lm() function and create an object named mod1 that contains results from the bivariate regression of the relationship between cost of attendance ($X$) and earnings 2 years after graduation ($Y$). We run the anova() function to get the values of ESS, SSR, TSS.

X= coa_grad_res

Y= earn_mdn_hi_2yr

mod1 <- lm(formula = earn_mdn_hi_2yr ~ coa_grad_res, data = df_edu) summary(mod1) #> #> Call: #> lm(formula = earn_mdn_hi_2yr ~ coa_grad_res, data = df_edu) #> #> Residuals: #> Min 1Q Median 3Q Max #> -30748 -7469 -2021 4669 54245 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 47019.11832 2113.15268 22.251 < 0.0000000000000002 *** #> coa_grad_res 0.31536 0.07279 4.332 0.0000195 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 12380 on 338 degrees of freedom #> (55 observations deleted due to missingness) #> Multiple R-squared: 0.05261, Adjusted R-squared: 0.0498 #> F-statistic: 18.77 on 1 and 338 DF, p-value: 0.00001948 anova(mod1) #> Analysis of Variance Table #> #> Response: earn_mdn_hi_2yr #> Df Sum Sq Mean Sq F value Pr(>F) #> coa_grad_res 1 2876756208 2876756208 18.768 0.00001948 *** #> Residuals 338 51807412707 153276369 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

YOUR ANSWER HERE: TSS:

YOUR ANSWER HERE: ESS:

YOUR ANSWER HERE: SSR:

/2

3. Using the values for ESS, SSR, and TSS from above, calculate $R^2$ in both ways discussed in lecture (can do it by hand below or in a code chunk). Interpret the value of $R^2$ in words.

YOUR ANSWER HERE:

/4

4. Explain what sample standard deviation of a variable means in words and write out the formula. Explain what standard error of the regression (SER) means in words and write out the formula for SER (in terms of SSR).

YOUR ANSWER HERE:

/2

5. Run the analysis of the relationship between cost of attendance ($X$) and debt ($Y$)., do the following: run the regression in R (using lm() and summary()) and assign it to the object mod2; report the SER; and calculate the standard deviation of the dependent variable sd(), debt debt_all_stgp_eval_mean, in the regression model.

X= coa_grad_res

Y= debt_all_stgp_eval_mean

/3

6. Interpret the SER from the above model in words. Interpret sample standard deviation from above model in words. Does our model make our prediction substantially better?

YOUR ANSWER HERE:

Part II: Categorical X variables

In this section, we will explore the relationship between Carneige classification ($X$) and earnings ($Y$) 2 years after graduating for graduates of MA programs in Education Administration and Supervision. The $X$ variable in this model is carnegie and it is a factor variable that represents a framework for classifying higher education institutions in the U.S. See here for more info. The $Y$ variable in this model is earn_mdn_hi_2yr earnings two years after graduating.

$X$ = carnegie

$Y$ = earn_mdn_hi_2yr

/1

1. Explain what a reference group is in words.

YOUR ANSWER HERE:

/1

2. In our analysis of the relationship between carnegie ($X$) and earnings ($Y$), which category of the carnegie variable will be the reference group?

Below is a frequency count of our $X$ (factor) variable carnegie.

df_edu_fac %>% count(carnegie) #> # A tibble: 4 x 2 #> carnegie n #> <fct> <int> #> 1 research 1 75 #> 2 research 2 140 #> 3 masters 1 135 #> 4 masters 2 45 df_edu_fac %>% count(as.integer(carnegie)) #> # A tibble: 4 x 2 #> `as.integer(carnegie)` n #> <int> <int> #> 1 1 75 #> 2 2 140 #> 3 3 135 #> 4 4 45

YOUR ANSWER HERE:

/2

3. What is a factor variable and why is it important for running a regression?

YOUR ANSWER HERE:

/6

4. For your analysis of the relationship between classification of university ($X$) and institution-level student earnings ($Y$) for graduates of MA programs in Education Administration and Supervision, do the following: write out the population linear regression model (label the symbols and write out what the variables $X$ and $Y$ actually represent); write out the OLS prediction line (without estimate values); write out the OLS prediction line (with estimate values);

Some investigations of the variable categorical $X$ variable urban

X= carnegie

Y= earn_mdn_hi_2yr

YOUR ANSWER HERE:

/3

5. Interpet the point estimate value(s) of $\hat{\beta}_1$, $\hat{\beta}_2$, $\hat{\beta}_3$ in words.

YOUR ANSWER HERE:

/1

6. Interpet the point estimate value of $\hat{\beta}_0$ in words.

YOUR ANSWER HERE:

/3

7. What is the predicted value of $Y$ for each of the following university types. Show work (OLS prediction line w/ estimates; then result of calculation).

OLS line with estimates

Hint: Our OLS prediction line looks something like this:$\hat{Y_i} =$ $51,988.34 + 2745*X_{1i} + 6346 *X_{2i} + 5450*X_{3i}$

Calculation:

If all other values of the independent variable are 0 ($X_2$,$X_3$) except $X_1$, then our OLS prediction line looks like this:

$\hat{Y_i} =$ $51,988.34 + 2745*X_{1i} + 6346 *0 + 5450*0$

$\hat{Y_i} =$ $51,988.34 + 2745$

$\hat{Y_i} =$ $54,733.34$

Now do the following for each value of X

Non reference group 2 ($X_2$) = (Master’s 1)

Non reference group 3 ($X_3$) = (Master’s 2)

Reference group = Research 1

YOUR ANSWER HERE:

/3

8. For one of the non-reference group categories of $X$ (e.g., Research 2, Master’s 1, Master’s 2) do the following:

State the null and alternative hypothesis

Solve for value of t using information from the regression output

Using the output from the model, interpret the p-value in words and make a conclusion.

YOUR ANSWER HERE:

Part III: Confidence interval about $\beta_k$

/1

1. In words, what is a confidence interval?

YOUR ANSWER HERE:

/1

2. What is the formula for a confidence interval about some population parameter?

YOUR ANSWER HERE:

/1

3. What is the formula for a 95% confidence interval about a population regression coefficient $\hat{\beta}$

YOUR ANSWER HERE:

/3

4. For a non-reference group category of variable $X$ (a different category than the one you chose for previous set of questions), do the following:

Interpret $\hat{\beta}_k$ in words

Calculate the 95% confidence interval

Interpret the 95% confidence interval in words

YOUR ANSWER HERE:

Part IV: Post a comment/question

/2

Go to the class #problemsets channel and create a new post.

You can either:

Share something you learned or a question from this problem set. Make sure to mention the instructors (@ozanj, @Patricia Martín).

Respond to a post made by another student.

Knit to html and submit problem set

Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”

Go to the class website and under the “Readings & Assignments” >> “Week 7” tab, click on the “Problem set 3 submission link”

Submit both your html and .Rmd files

Use this naming convention “lastname_firstname_ps#” for your .Rmd (e.g. martin_patricia_ps3.Rmd)

EDUC152, Problem Set #3

Overview

Load libraries and data

Part I: Measures of model fit

1. Define \(R^2\) in words. Write out the mathematical formula for \(R^2\) using both ways discussed in lecture.

2. Using the code from below to guide you, write out the formula for TSS and explain what it means in words. Do the same for ESS and SSR.

3. Using the values for ESS, SSR, and TSS from above, calculate \(R^2\) in both ways discussed in lecture (can do it by hand below or in a code chunk). Interpret the value of \(R^2\) in words.

4. Explain what sample standard deviation of a variable means in words and write out the formula. Explain what standard error of the regression (SER) means in words and write out the formula for SER (in terms of SSR).

6. Interpret the SER from the above model in words. Interpret sample standard deviation from above model in words. Does our model make our prediction substantially better?

Part II: Categorical X variables

1. Explain what a reference group is in words.

2. In our analysis of the relationship between carnegie (\(X\)) and earnings (\(Y\)), which category of the `carnegie` variable will be the reference group?

3. What is a factor variable and why is it important for running a regression?

5. Interpet the point estimate value(s) of \(\hat{\beta}_1\), \(\hat{\beta}_2\), \(\hat{\beta}_3\) in words.

6. Interpet the point estimate value of \(\hat{\beta}_0\) in words.

7. What is the predicted value of \(Y\) for each of the following university types. Show work (OLS prediction line w/ estimates; then result of calculation).

8. For one of the non-reference group categories of \(X\) (e.g., Research 2, Master’s 1, Master’s 2) do the following:

Part III: Confidence interval about \(\beta_k\)

1. In words, what is a confidence interval?

2. What is the formula for a confidence interval about some population parameter?

3. What is the formula for a 95% confidence interval about a population regression coefficient \(\hat{\beta}\)

4. For a non-reference group category of variable \(X\) (a different category than the one you chose for previous set of questions), do the following:

Part IV: Post a comment/question

Knit to html and submit problem set

EDUC152, Problem Set #3

Overview

Load libraries and data

Part I: Measures of model fit

1. Define \(R^2\) in words. Write out the mathematical formula for \(R^2\) using both ways discussed in lecture.

2. Using the code from below to guide you, write out the formula for TSS and explain what it means in words. Do the same for ESS and SSR.

3. Using the values for ESS, SSR, and TSS from above, calculate \(R^2\) in both ways discussed in lecture (can do it by hand below or in a code chunk). Interpret the value of \(R^2\) in words.

4. Explain what sample standard deviation of a variable means in words and write out the formula. Explain what standard error of the regression (SER) means in words and write out the formula for SER (in terms of SSR).

6. Interpret the SER from the above model in words. Interpret sample standard deviation from above model in words. Does our model make our prediction substantially better?

Part II: Categorical X variables

1. Explain what a reference group is in words.

2. In our analysis of the relationship between carnegie (\(X\)) and earnings (\(Y\)), which category of the carnegie variable will be the reference group?

3. What is a factor variable and why is it important for running a regression?

5. Interpet the point estimate value(s) of \(\hat{\beta}_1\), \(\hat{\beta}_2\), \(\hat{\beta}_3\) in words.

6. Interpet the point estimate value of \(\hat{\beta}_0\) in words.

7. What is the predicted value of \(Y\) for each of the following university types. Show work (OLS prediction line w/ estimates; then result of calculation).

8. For one of the non-reference group categories of \(X\) (e.g., Research 2, Master’s 1, Master’s 2) do the following:

Part III: Confidence interval about \(\beta_k\)

1. In words, what is a confidence interval?

2. What is the formula for a confidence interval about some population parameter?

3. What is the formula for a 95% confidence interval about a population regression coefficient \(\hat{\beta}\)

4. For a non-reference group category of variable \(X\) (a different category than the one you chose for previous set of questions), do the following:

Part IV: Post a comment/question

Knit to html and submit problem set

2. In our analysis of the relationship between carnegie (\(X\)) and earnings (\(Y\)), which category of the `carnegie` variable will be the reference group?