Load libraries and data

Note: code chunk omitted from html document using include = FALSE

Grade: /80

Overview


Problem set 4 is the take-home final for EDUC152. As such, it is longer and more challenging than previous problem sets. Because this is a take-home final, we cannot answer questions except for providing clarification. If you need clarification on a question, please send us a DM and/or come to our office hours.

The first part of the problem set asks to answer questions about the article by Cabrera, Milem, Jaquette, & Marx (2014):

  • Cabrera, N. L., Milem, J. F., Jaquette, O., & Marx, R. (2014). Missing the (student achievement) forest for all the (political) trees: Empiricism and the Mexican American Studies controversy in Tucson. American Educational Research Journal, 51(6), 1084-1118.

The remainder of the problem set asks questions about the following research question, "What is the effect of taking developmental math courses in college (\(X\)) on student success (\(Y\)) for students who start postsecondary education at a community college and who have a delayed entry (at least a one-year delay between completing high school and starting college)?

  • We will consider two different outcome variables, \(Y\)
    • (continuous) total number of credits earned in postsecondary education (across all insitutions attended)
    • (dichotomous) whether the student ever transferred to a four-year institution
  • Developmental math courses, \(X\), teach math skills that students are generally expected to learn in high school
    • These courses are sometimes referred to as “remedial” but I prefer the term developmental
    • Usually, students are assigned to developmental math courses based on their score in a placement exam or based on the recommendation of a community college guidance counselor
    • Generally, developmental math courses do not count for “college credit” but students must pass these courses before they can take college credit math courses
  • Many researchers and policymakers have been critical about placing too many community college students in developmental math courses and recommend that most students should be placed immediately into college-level math courses.
  • Our research question will assess whether taking developmental math courses helps or hinders the success of students who have a delayed entry to college (at least a one-year delay between completing high school and starting college)
Click here for tips on notation


Some questions will ask you to write out notation and/or equations.

You can write out notation/equations one of two ways: (1) using “inline equations,” which begin with a dollar sign $ and end with a dollar sign $; OR (2) you can write out notation/equation in plain text without. We encourage you to try inline equations, but fine if you do not.

Tips on writing notation/equations using "inline equations’:

  • Make sure there are no spaces after the dollar sign $ that begins the equation and no spaces before the dollar sign that ends the equation.
    • For example, you would write out the notation for treated potential outcome like this: \(Y_i(1)\)
    • But this wouldn’t work: $ Y_i(1)$
    • And this wouldn’t work: $Y_i(1) $
  • Special characters – like greek letters – within inline equations are referred to using special symbols that start with a backslash
    • e.g., “Beta” is \beta: \(\beta\)
    • “Mu” (symbol for population mean) is \mu: \(\mu\)
  • Subscripts after a character or symbol are specified like this:
    • e.g., “Beta subscript 1” is beta_1: \(\beta_1\)
    • e.g., “Mu subscript Y” (referring to population mean of variable Y) is \mu_Y: \(\mu_Y\)
  • “hats” are specified by wrapping the character/symbol within curly brackets \hat{} like this:
    • e.g., “Beta hat” is \hat{\beta}: \(\hat{\beta}\)
    • e.g., “Beta hat subscript 1” is \hat{\beta}_1 (note that the subscript is not within the “hat”): \(\hat{\beta}_1\)
  • “bars” are specified by wrapping the character/symbol within curly brackets \bar{} like this:
    • e.g., “sample mean of Y” is \bar{Y}: \(\bar{Y}\)
  • Don’t worry about getting it perfect and don’t spend too much time trying to get it perfect; if you are trying, that is a great start! and fine to use inline equations for some notation/equations and plain text for others that you can’t figure out.


Tips on writing notation/equations in plain text

  • Instead of writing \(Y_i(1)\), you could write this: Y_i(1)
  • Instead of writing \(Y_i = \beta_0 + \beta_1X_i + u_i\), you could write this: Y_i = beta_0 + beta_1*X_i + u_i
  • Instead of writing \(\hat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1X_i\), you could write something like this: Y_hat_i = beta_hat_0 + beta_hat_1*X_i
  • don’t worry if it doesn’t look pretty!


Part I: Questions about Cabrera et al. (2014)

/2

1. In a couple sentences, what was the Mexican American Studies program and who was allowed to participate?

  • YOUR ANSWER HERE:

/2

2. In social science research a “mechanism” is an explanation for why one variable (\(X\)) has a causal effect on another variable (\(Y\)). In your own words, what are 2-3 mechanisms for why participation in the MAS program would have a positive causal effect on the probability of graduating from high school

  • YOUR ANSWER HERE:

/1

3. What does the conditional independendence assumption mean, in your own words?

  • YOUR ANSWER HERE:

/2

4. What does omitted variable bias mean, in your own words? and what are the two conditions that must be satisfied for an omitted variable \(Z\) to cause omitted variable bias?

  • YOUR ANSWER HERE:

/2

5. What is the connection between the conditional independence assumption and omitted variable bias?

  • YOUR ANSWER HERE:

/1

6. If students had been randomly assigned to participate in MAS (as in an experiment), why would we be unconcerned about omitted variable bias?

  • YOUR ANSWER HERE:

/3

8. In observational data (unit x not randomly assigned to values of \(X\)), when researchers use regression to estimate the causal effect of \(X\) on \(Y\), explain the primary strategy researchers use to eliminate (or at least reduce) omitted variable bias?

  • YOUR ANSWER HERE:

/2

9. The “linear probability model” basically means applying ordinary least squares (OLS) regression to a dichotomous (0/1) outcome variable as opposed to a continuous outcome variable. In the linear probability model, what is the “generic” interpretation of \(\hat{\beta}_1\) when \(X_1\) is a categorical variable?

  • YOUR ANSWER HERE:

/4

10. Below is a partial screenshot of Table 5 from Cabrera et al. (2014), where the dependent variable is a dichotomous measure of whether the student graduated from high school, and the independent variable of interest is a categorical measure of the number of MAS courses taken (reference group is zero MAS courses). Each column in Table 5 is a different regression model. Interpret the coefficients for the “All Cohorts” model in words

  • YOUR ANSWER HERE:

Part II: Effect of developmental math on college credits

The code chunk below shows some descriptive statistics about the variables in the model.

  • The data frame we will use is df_els_cc_delay
    • each observation \(i\) is a student
    • consists of students who started postsecondary education at a community college and who had at least a one-year delay between finishing high school and starting college
    • all categorical variables are factor class
    • Note: the data frame df_els_cc is the same as df_els_cc_delay but includes students who started at community college with no delay after finishing hich school
  • The dependent variable, \(Y\), is a continuous measure of the total number of postsecondary credits earned, across all institutions
    • measured by f3tzpostern
    • variable label: Transcript: Postsecondary career: known credits earned
  • The independent variable, \(X\), is a three-categorical measure of number of developmental math courses taken. We have several alternative versions of this variable:
    • label for the variable dev_math_cat3: three category indicator of whether student took any developmental math courses in postsecondary education (based on f3tzremmttot)

# dependent variable: number of postsecondary credits
df_els_cc_delay %>% summarize(
  mean_pse_cred = mean(f3tzpostern, na.rm = TRUE),
  sd_pse_cred = sd(f3tzpostern, na.rm = TRUE)
)
#> # A tibble: 1 x 2
#>   mean_pse_cred sd_pse_cred
#>           <dbl>       <dbl>
#> 1          45.1        47.2

# independent variable: taking developmental math courses

  # three category
  df_els_cc_delay %>% count(dev_math_cat3)
#> # A tibble: 3 x 2
#>   dev_math_cat3     n
#>   <fct>         <int>
#> 1 0 courses       318
#> 2 1 course        173
#> 3 2+ courses      195

  # four category
  #df_els_cc_delay %>% count(dev_math_cat4)
  
  # dichotomous
  #df_els_cc_delay %>% count(dev_math_01)

/2

1. Our goal is to estimate the causal effect of taking development math courses (\(X\)) on number of postsecondary credits earned (\(Y\)) for the population of students who start at a community college and who have at least a one-year delay between finishing high school and starting college. Recall that a “mechanism” is an explanation for why one variable (\(X\)) has a causal effect on another variable (\(Y\)). Give one explanation for why taking developmental math courses (\(X\)) might have a negative causal effect on number of postsecondary credits earned (\(Y\)) for our population of interest (one or two sentences is fine). Give one explanation for why taking developmental math courses (\(X\)) might have a positive causal effect on number of postsecondary credits earned (\(Y\)) for our population of interest (one or two sentences is fine).

  • YOUR ANSWER HERE:

/8

2. Run the below model of the relationship between developmental math (\(X\)) and college credits earned (\(Y\)) with no control variables and then do the following.

Do the following:

  • Write out the population linear regression model (make sure to define variables)
  • Write out the OLS prediction line without estimate values
  • Write out the OLS prediction line with estimate values
  • Interpret the coefficients associated with developmental math in words
  • For each coefficient associated with developmental math do we reject the null hypothesis \(H_0: \beta_k = 0\) using an alpha-level (rejection region) of .05?
  • Interpret the coefficient for \(\hat{\beta_0}\) in words
cred_mod1 <- lm(formula = f3tzpostern ~ dev_math_cat3, data = df_els_cc_delay)

summary(cred_mod1)
#> 
#> Call:
#> lm(formula = f3tzpostern ~ dev_math_cat3, data = df_els_cc_delay)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -57.02 -34.06 -16.97  21.02 184.94 
#> 
#> Coefficients:
#>                         Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)               38.969      2.614  14.906 < 0.0000000000000002 ***
#> dev_math_cat31 course      4.089      4.404   0.928                0.353    
#> dev_math_cat32+ courses   18.052      4.240   4.257            0.0000236 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 46.62 on 683 degrees of freedom
#> Multiple R-squared:  0.02649,    Adjusted R-squared:  0.02364 
#> F-statistic: 9.291 on 2 and 683 DF,  p-value: 0.0001044
  • YOUR ANSWER HERE:

/3

3. Define \(R^2\) in words. Write out the mathematical formula for \(R^2\) and interpret the value of \(R^2\) from the above model in words

  • YOUR ANSWER HERE:

/3

4. Explain what standard error of the regression (SER) means in words, write out the formula for SER (in terms of SSR), and interpret the value of SER from the above model in words

  • YOUR ANSWER HERE:

/3

5. Explain what sample standard deviation of a variable means in words and write out the formula. Interpet the value of standard deviation for the variable f3tzpostern (given below) in words. Using your judgment, does our regression model make our predictions about the value of f3tzpostern substantially better than just using the sample mean of f3tzpostern?

# standard deviation of Y
sd(df_els_cc_delay$f3tzpostern, na.rm = TRUE)
#> [1] 47.17989

# Standard error of the regression
summary(cred_mod1)$sigma
#> [1] 46.619
  • YOUR ANSWER HERE:

/4

6. Now, we will start adding “control” variables to our model. We will add ,bytxmstd, a continuous measure of high school standardized math score, and f2everdo a dichotomous categorical measure of whether the student ever had a “dropout episode” in high school. Run the below code which provides descriptive statistics of these variables. Separately for each of these variables, answer the following: Why might this variable satisfy the fist condition of omitted variable bias? Why might this variable satisfy the second condition of omitted variable bias?

  • YOUR ANSWER HERE:

Descriptive statistics

# continuous control, high school math standardized test score
df_els_cc_delay %>% summarize(
  mean_hs_math = mean(bytxmstd, na.rm = TRUE),
  min_hs_math = min(bytxmstd, na.rm = TRUE),
  max_hs_math = max(bytxmstd, na.rm = TRUE),
  sd_hs_math = sd(bytxmstd, na.rm = TRUE)
)
#> # A tibble: 1 x 4
#>   mean_hs_math min_hs_math max_hs_math sd_hs_math
#>          <dbl>   <dbl+lbl>   <dbl+lbl>      <dbl>
#> 1         47.7        22.3        75.1       8.65

# dichotomous measure of whether student ever had a dropout episode
df_els_cc_delay %>% count(f2everdo)
#> # A tibble: 2 x 2
#>   f2everdo                                     n
#>   <fct>                                    <int>
#> 1 No available evidence of dropout episode   577
#> 2 Evidence of a dropout episode              109

# dichotomous measure of whether student ever had a dropout episode, showing integer values rather than factor levels
df_els_cc_delay %>% count(as.integer(f2everdo))
#> # A tibble: 2 x 2
#>   `as.integer(f2everdo)`     n
#>                    <int> <int>
#> 1                      1   577
#> 2                      2   109

/10

7. Run the below model of the relationship between developmental math (\(X\)) and college credits earned (\(Y\)) with the inclusion of control variables, and then do the following:

Do the following:

  • Write out the population linear regression model (make sure to define variables)
  • Write out the OLS prediction line without estimate values
  • Write out the OLS prediction line with estimate values
  • Interpret the coefficients associated with developmental math in words
  • Interpret the coefficient associated with high school math score in words
  • For each coefficient associated with developmental math do we reject the null hypothesis \(H_0: \beta_k = 0\) using an alpha-level (rejection region) of .05?
  • Interpret the coefficient associated with “ever dropped out” in words
  • Interpret the coefficient for \(\hat{\beta_0}\) in words
cred_mod2 <- lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, data = df_els_cc_delay)

summary(cred_mod2)
#> 
#> Call:
#> lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, 
#>     data = df_els_cc_delay)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -61.43 -32.76 -16.25  18.80 194.39 
#> 
#> Coefficients:
#>                                       Estimate Std. Error t value    Pr(>|t|)    
#> (Intercept)                             0.2327    10.6078   0.022      0.9825    
#> dev_math_cat31 course                   6.8860     4.3948   1.567      0.1176    
#> dev_math_cat32+ courses                21.2571     4.2554   4.995 0.000000747 ***
#> bytxmstd                                0.8147     0.2074   3.928 0.000094201 ***
#> f2everdoEvidence of a dropout episode -11.0109     4.8070  -2.291      0.0223 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 46 on 681 degrees of freedom
#> Multiple R-squared:  0.05485,    Adjusted R-squared:  0.0493 
#> F-statistic:  9.88 on 4 and 681 DF,  p-value: 0.00000008951

#summary(lm(formula = f3tzpostern ~ dev_math_cat4 + bytxmstd + f2everdo, data = df_els_cc_delay))
#summary(lm(formula = f3tzpostern ~ dev_math_cat4 + bytxmstd + f2everdo, data = df_els_cc))
  • YOUR ANSWER HERE:

/2

8. With respect to the above model, interpret the value of \(R^2\) in words (fine to use “R-squared” rather than “adjusted R-squared”) and interpret the value of SER in words

  • YOUR ANSWER HERE:

/4

9. Choose two additional control variables that you think satisfy both conditions of omitted variable bias that you will add to the model. Separately for each of these two variables, answer the following: why might this variable satisfy the fist condition of omitted variable bias? Why might this variable satisfy the second condition of omitted variable bias? If the variable is categorical, which category will be used as the “reference category” when we include this variable in a regression model?

Note: to help you think of control variables you might add to your model, the below code shows variable labels associated with each variable (output omitted)

  • YOUR ANSWER HERE:

/2

10. Run the regression model with the inclusion of the two new control variables you identified above. Create an object called cred_mod3 with the results of this regression model and then summarize the object cred_mod3.

Note: if you are having trouble adding variables to your model or the output looks odd, send a DM on slack to Patricia Martín for help.

/6

11. Answer the following questions based on the regression model associated with cred_mod3:

  • Interpret the coefficients associated with developmental math in words
  • For each coefficient associated with developmental math do we reject the null hypothesis \(H_0: \beta_k = 0\) using an alpha-level (rejection region) of .05?
  • Interpret two coefficients associated with the new control variables you decided to add to your model
    • Note: If you added a categorical variable with six categories, this would be associated with five regression coefficients. You don’t have to interpret the coefficients for all five coefficients. You could just interpret two of these coefficients.
  • YOUR ANSWER HERE:

/3

12. Below we show the regression output associated with the model cred_mod2 (control variables are high school math score and ever-droppoed out of high school). cred_mod2 is based on the data frame df_els_cc_delay which includes students who started at a community college and had at least a one-year delay between finishing high school and starting college. We also show the regression output from the model cred_mod2_allcc, includes the same variables as cred_mod2 but is based on the data frame which includes all students who started at community college, regardless of whether they delayed entry between high school and college. After running the below models, answer the following questions:

Questions to answer:

  • Interpret the coefficients associated with developmental math from the model cred_mod2_allcc.
  • In a couple of sentences, why do you think these coefficients are different than the coefficients from the model cred_mod2?
cred_mod2 <- lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, data = df_els_cc_delay)

summary(cred_mod2)
#> 
#> Call:
#> lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, 
#>     data = df_els_cc_delay)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -61.43 -32.76 -16.25  18.80 194.39 
#> 
#> Coefficients:
#>                                       Estimate Std. Error t value    Pr(>|t|)    
#> (Intercept)                             0.2327    10.6078   0.022      0.9825    
#> dev_math_cat31 course                   6.8860     4.3948   1.567      0.1176    
#> dev_math_cat32+ courses                21.2571     4.2554   4.995 0.000000747 ***
#> bytxmstd                                0.8147     0.2074   3.928 0.000094201 ***
#> f2everdoEvidence of a dropout episode -11.0109     4.8070  -2.291      0.0223 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 46 on 681 degrees of freedom
#> Multiple R-squared:  0.05485,    Adjusted R-squared:  0.0493 
#> F-statistic:  9.88 on 4 and 681 DF,  p-value: 0.00000008951

cred_mod2_allcc <- lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, data = df_els_cc)

summary(cred_mod2_allcc)
#> 
#> Call:
#> lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, 
#>     data = df_els_cc)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -102.13  -48.44  -12.07   44.89  225.06 
#> 
#> Coefficients:
#>                                       Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)                            -5.6978     8.0425  -0.708                0.479    
#> dev_math_cat31 course                  -0.6772     3.1472  -0.215                0.830    
#> dev_math_cat32+ courses                12.4269     3.0045   4.136      0.0000365950466 ***
#> bytxmstd                                1.6166     0.1526  10.597 < 0.0000000000000002 ***
#> f2everdoEvidence of a dropout episode -32.4661     4.8269  -6.726      0.0000000000219 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 58.83 on 2291 degrees of freedom
#> Multiple R-squared:  0.06953,    Adjusted R-squared:  0.06791 
#> F-statistic:  42.8 on 4 and 2291 DF,  p-value: < 0.00000000000000022
  • YOUR ANSWER HERE:

Part III: Effect of developmental math on transfer to 4-yr

The below code shows descriptive statistics for the dichotomous variable f3tzever4yr, which identifies whether the student ever attended a 4-year institution, for our sample of students who started at a community college and had at least a one-year delay between finishing high school and starting college

df_els_cc_delay %>% count(f3tzever4yr)
#> # A tibble: 2 x 2
#>   f3tzever4yr     n
#>   <fct>       <int>
#> 1 No            531
#> 2 Yes           155

The below code runs a model of the relationship between developmental math (\(X\)) and a dichotomous measure of whether the student ever transferred to a four-year institution (\(Y\)), controlling for high school math score and whether the student ever dropped out of high school.

transfer_mod1 <- lm(formula = as.integer(f3tzever4yr) ~ dev_math_cat3 + bytxmstd + f2everdo, data = df_els_cc_delay)

summary(transfer_mod1)
#> 
#> Call:
#> lm(formula = as.integer(f3tzever4yr) ~ dev_math_cat3 + bytxmstd + 
#>     f2everdo, data = df_els_cc_delay)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.3926 -0.2481 -0.2028 -0.1009  0.9678 
#> 
#> Coefficients:
#>                                        Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)                            3.950634   0.095750  41.260 < 0.0000000000000002 ***
#> dev_math_cat31 course                 -0.012711   0.039669  -0.320              0.74875    
#> dev_math_cat32+ courses                0.039814   0.038411   1.037              0.30032    
#> bytxmstd                               0.005881   0.001872   3.142              0.00175 ** 
#> f2everdoEvidence of a dropout episode -0.084423   0.043390  -1.946              0.05210 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4152 on 681 degrees of freedom
#> Multiple R-squared:  0.02134,    Adjusted R-squared:  0.01559 
#> F-statistic: 3.712 on 4 and 681 DF,  p-value: 0.005343

/8

1. Based on the above model:

Do the following:

  • Write out the population linear regression model (make sure to define variables)

  • Write out the OLS prediction line without estimate values

  • Write out the OLS prediction line with estimate values

  • Interpret the coefficients associated with developmental math in words

  • For each coefficient associated with developmental math do we reject the null hypothesis \(H_0: \beta_k = 0\) using an alpha-level (rejection region) of .05?

  • YOUR ANSWER HERE:

Part IV: BONUS question (OPTIONAL)

/10

What are some things we should do (e.g., different dependent variable, additional controls varaibles in our model, additional sample restrictions) to improve the analysis so we would feel comfortable making policy recommendations based on this analysis?

  • Write at least a paragraph or two to explain your reasoning.

Part V: Post a comment/question

/2

  • Go to the class #problemsets channel and create a new post.
  • You can either:
    • Share something you learned or a question from this problem set. Make sure to mention the instructors ((???), (???) Martín).
    • Respond to a post made by another student.

Knit to html and submit problem set

Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”

  • Go to the class website and under the “Readings & Assignments” >> “Week 10” tab, click on the “Problem set 4 submission link”
  • Submit both your html and .Rmd files
  • Use this naming convention “lastname_firstname_ps#” for your .Rmd (e.g. martin_patricia_ps4.Rmd)

References

Cabrera, N. L., Milem, J. F., Jaquette, O., & Marx, R. (2014). Missing the (student achievement) forest for all the (political) trees: Empiricism and the mexican american studies controversy in tucson. American Educational Research Journal, 51(6), 1084–1118.