EDUC152, Problem Set #4

Load libraries and data

Note: code chunk omitted from html document using include = FALSE

Grade: /80

Overview

Problem set 4 is the take-home final for EDUC152. As such, it is longer and more challenging than previous problem sets. Because this is a take-home final, we cannot answer questions except for providing clarification. If you need clarification on a question, please send us a DM and/or come to our office hours.

The first part of the problem set asks to answer questions about the article by Cabrera, Milem, Jaquette, & Marx (2014):

Cabrera, N. L., Milem, J. F., Jaquette, O., & Marx, R. (2014). Missing the (student achievement) forest for all the (political) trees: Empiricism and the Mexican American Studies controversy in Tucson. American Educational Research Journal, 51(6), 1084-1118.

The remainder of the problem set asks questions about the following research question, "What is the effect of taking developmental math courses in college ($X$) on student success ($Y$) for students who start postsecondary education at a community college and who have a delayed entry (at least a one-year delay between completing high school and starting college)?

We will consider two different outcome variables, $Y$

(continuous) total number of credits earned in postsecondary education (across all insitutions attended)

(dichotomous) whether the student ever transferred to a four-year institution

Developmental math courses, $X$, teach math skills that students are generally expected to learn in high school

These courses are sometimes referred to as “remedial” but I prefer the term developmental

Usually, students are assigned to developmental math courses based on their score in a placement exam or based on the recommendation of a community college guidance counselor

Generally, developmental math courses do not count for “college credit” but students must pass these courses before they can take college credit math courses

Many researchers and policymakers have been critical about placing too many community college students in developmental math courses and recommend that most students should be placed immediately into college-level math courses.

Our research question will assess whether taking developmental math courses helps or hinders the success of students who have a delayed entry to college (at least a one-year delay between completing high school and starting college)

Click here for tips on notation

Some questions will ask you to write out notation and/or equations.

You can write out notation/equations one of two ways: (1) using “inline equations,” which begin with a dollar sign $ and end with a dollar sign $; OR (2) you can write out notation/equation in plain text without. We encourage you to try inline equations, but fine if you do not.

Tips on writing notation/equations using "inline equations’:

Make sure there are no spaces after the dollar sign $ that begins the equation and no spaces before the dollar sign that ends the equation.

For example, you would write out the notation for treated potential outcome like this: $Y_i(1)$

But this wouldn’t work: $ Y_i(1)$

And this wouldn’t work: $Y_i(1) $

Special characters – like greek letters – within inline equations are referred to using special symbols that start with a backslash

e.g., “Beta” is \beta: $\beta$

“Mu” (symbol for population mean) is \mu: $\mu$

Subscripts after a character or symbol are specified like this:

e.g., “Beta subscript 1” is beta_1: $\beta_1$

e.g., “Mu subscript Y” (referring to population mean of variable Y) is \mu_Y: $\mu_Y$

“hats” are specified by wrapping the character/symbol within curly brackets \hat{} like this:

e.g., “Beta hat” is \hat{\beta}: $\hat{\beta}$

e.g., “Beta hat subscript 1” is \hat{\beta}_1 (note that the subscript is not within the “hat”): $\hat{\beta}_1$

“bars” are specified by wrapping the character/symbol within curly brackets \bar{} like this:

e.g., “sample mean of Y” is \bar{Y}: $\bar{Y}$

Don’t worry about getting it perfect and don’t spend too much time trying to get it perfect; if you are trying, that is a great start! and fine to use inline equations for some notation/equations and plain text for others that you can’t figure out.

Tips on writing notation/equations in plain text

Instead of writing $Y_i(1)$, you could write this: Y_i(1)

Instead of writing $Y_i = \beta_0 + \beta_1X_i + u_i$, you could write this: Y_i = beta_0 + beta_1*X_i + u_i

Instead of writing $\hat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1X_i$, you could write something like this: Y_hat_i = beta_hat_0 + beta_hat_1*X_i

don’t worry if it doesn’t look pretty!

Part I: Questions about Cabrera et al. (2014)

/2

1. In a couple sentences, what was the Mexican American Studies program and who was allowed to participate?

YOUR ANSWER HERE:

/2

2. In social science research a “mechanism” is an explanation for why one variable ($X$) has a causal effect on another variable ($Y$). In your own words, what are 2-3 mechanisms for why participation in the MAS program would have a positive causal effect on the probability of graduating from high school

YOUR ANSWER HERE:

/1

3. What does the conditional independendence assumption mean, in your own words?

YOUR ANSWER HERE:

/2

4. What does omitted variable bias mean, in your own words? and what are the two conditions that must be satisfied for an omitted variable $Z$ to cause omitted variable bias?

YOUR ANSWER HERE:

/2

5. What is the connection between the conditional independence assumption and omitted variable bias?

YOUR ANSWER HERE:

/1

6. If students had been randomly assigned to participate in MAS (as in an experiment), why would we be unconcerned about omitted variable bias?

YOUR ANSWER HERE:

/3

7. In reality, students self-selected into MAS participation or non-participation. Think of one variable that might affect (positively or negatively) high school graduation and might be related (positively or negatively) to MAS participation? Why would this variable affect the probability of graduation and why might this variable have a relationship with participation in MAS? Why would we be concerned about our estimate of the effect of MAS based on a regression model that did not include this variable?

YOUR ANSWER HERE:

/1

8. In observational data (unit x not randomly assigned to values of $X$), when researchers use regression to estimate the causal effect of $X$ on $Y$, explain the primary strategy researchers use to eliminate (or at least reduce) omitted variable bias?

YOUR ANSWER HERE:

/2

9. The “linear probability model” basically means applying ordinary least squares (OLS) regression to a dichotomous (0/1) outcome variable as opposed to a continuous outcome variable. In the linear probability model, what is the “generic” interpretation of $\hat{\beta}_1$ when $X_1$ is a categorical variable?

YOUR ANSWER HERE:

/4

10. Below is a partial screenshot of Table 5 from Cabrera et al. (2014), where the dependent variable is a dichotomous measure of whether the student graduated from high school, and the independent variable of interest is a categorical measure of the number of MAS courses taken (reference group is zero MAS courses). Each column in Table 5 is a different regression model. Interpret the coefficients for the “All Cohorts” model in words

YOUR ANSWER HERE:

Part II: Effect of developmental math on college credits

The code chunk below shows some descriptive statistics about the variables in the model.

The data frame we will use is df_els_cc_delay

each observation $i$ is a student

consists of students who started postsecondary education at a community college and who had at least a one-year delay between finishing high school and starting college

all categorical variables are factor class

Note: the data frame df_els_cc is the same as df_els_cc_delay but includes students who started at community college with no delay after finishing hich school

The dependent variable, $Y$, is a continuous measure of the total number of postsecondary credits earned, across all institutions

measured by f3tzpostern

variable label: Transcript: Postsecondary career: known credits earned

The independent variable, $X$, is a three-categorical measure of number of developmental math courses taken. We have several alternative versions of this variable:

label for the variable dev_math_cat3: three category indicator of whether student took any developmental math courses in postsecondary education (based on f3tzremmttot)

# dependent variable: number of postsecondary credits df_els_cc_delay %>% summarize( mean_pse_cred = mean(f3tzpostern, na.rm = TRUE), sd_pse_cred = sd(f3tzpostern, na.rm = TRUE) ) #> # A tibble: 1 x 2 #> mean_pse_cred sd_pse_cred #> <dbl> <dbl> #> 1 45.1 47.2 # independent variable: taking developmental math courses # three category df_els_cc_delay %>% count(dev_math_cat3) #> # A tibble: 3 x 2 #> dev_math_cat3 n #> <fct> <int> #> 1 0 courses 318 #> 2 1 course 173 #> 3 2+ courses 195 # four category #df_els_cc_delay %>% count(dev_math_cat4) # dichotomous #df_els_cc_delay %>% count(dev_math_01)

/2

1. Our goal is to estimate the causal effect of taking development math courses ($X$) on number of postsecondary credits earned ($Y$) for the population of students who start at a community college and who have at least a one-year delay between finishing high school and starting college. Recall that a “mechanism” is an explanation for why one variable ($X$) has a causal effect on another variable ($Y$). Give one explanation for why taking developmental math courses ($X$) might have a negative causal effect on number of postsecondary credits earned ($Y$) for our population of interest (one or two sentences is fine). Give one explanation for why taking developmental math courses ($X$) might have a positive causal effect on number of postsecondary credits earned ($Y$) for our population of interest (one or two sentences is fine).

YOUR ANSWER HERE:

/8

2. Run the below model of the relationship between developmental math ($X$) and college credits earned ($Y$) with no control variables and then do the following.

Do the following:

Write out the population linear regression model (make sure to define variables)

Write out the OLS prediction line without estimate values

Write out the OLS prediction line with estimate values

Interpret the coefficients associated with developmental math in words

For each coefficient associated with developmental math do we reject the null hypothesis $H_0: \beta_k = 0$ using an alpha-level (rejection region) of .05?

Interpret the coefficient for $\hat{\beta_0}$ in words

cred_mod1 <- lm(formula = f3tzpostern ~ dev_math_cat3, data = df_els_cc_delay) summary(cred_mod1) #> #> Call: #> lm(formula = f3tzpostern ~ dev_math_cat3, data = df_els_cc_delay) #> #> Residuals: #> Min 1Q Median 3Q Max #> -57.02 -34.06 -16.97 21.02 184.94 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 38.969 2.614 14.906 < 0.0000000000000002 *** #> dev_math_cat31 course 4.089 4.404 0.928 0.353 #> dev_math_cat32+ courses 18.052 4.240 4.257 0.0000236 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 46.62 on 683 degrees of freedom #> Multiple R-squared: 0.02649, Adjusted R-squared: 0.02364 #> F-statistic: 9.291 on 2 and 683 DF, p-value: 0.0001044

YOUR ANSWER HERE:

/3

3. Define $R^2$ in words. Write out the mathematical formula for $R^2$ and interpret the value of $R^2$ from the above model in words

YOUR ANSWER HERE:

/3

4. Explain what standard error of the regression (SER) means in words, write out the formula for SER (in terms of SSR), and interpret the value of SER from the above model in words

YOUR ANSWER HERE:

/3

5. Explain what sample standard deviation of a variable means in words and write out the formula. Interpet the value of standard deviation for the variable f3tzpostern (given below) in words. Using your judgment, does our regression model make our predictions about the value of f3tzpostern substantially better than just using the sample mean of f3tzpostern?

# standard deviation of Y sd(df_els_cc_delay$f3tzpostern, na.rm = TRUE) #> [1] 47.17989 # Standard error of the regression summary(cred_mod1)$sigma #> [1] 46.619

YOUR ANSWER HERE:

/4

6. Now, we will start adding “control” variables to our model. We will add ,bytxmstd, a continuous measure of high school standardized math score, and f2everdo a dichotomous categorical measure of whether the student ever had a “dropout episode” in high school. Run the below code which provides descriptive statistics of these variables. Separately for each of these variables, answer the following: Why might this variable satisfy the fist condition of omitted variable bias? Why might this variable satisfy the second condition of omitted variable bias?

YOUR ANSWER HERE:

Descriptive statistics

# continuous control, high school math standardized test score df_els_cc_delay %>% summarize( mean_hs_math = mean(bytxmstd, na.rm = TRUE), min_hs_math = min(bytxmstd, na.rm = TRUE), max_hs_math = max(bytxmstd, na.rm = TRUE), sd_hs_math = sd(bytxmstd, na.rm = TRUE) ) #> # A tibble: 1 x 4 #> mean_hs_math min_hs_math max_hs_math sd_hs_math #> <dbl> <dbl+lbl> <dbl+lbl> <dbl> #> 1 47.7 22.3 75.1 8.65 # dichotomous measure of whether student ever had a dropout episode df_els_cc_delay %>% count(f2everdo) #> # A tibble: 2 x 2 #> f2everdo n #> <fct> <int> #> 1 No available evidence of dropout episode 577 #> 2 Evidence of a dropout episode 109 # dichotomous measure of whether student ever had a dropout episode, showing integer values rather than factor levels df_els_cc_delay %>% count(as.integer(f2everdo)) #> # A tibble: 2 x 2 #> `as.integer(f2everdo)` n #> <int> <int> #> 1 1 577 #> 2 2 109

/10

7. Run the below model of the relationship between developmental math ($X$) and college credits earned ($Y$) with the inclusion of control variables, and then do the following:

Do the following:

Write out the population linear regression model (make sure to define variables)

Write out the OLS prediction line without estimate values

Write out the OLS prediction line with estimate values

Interpret the coefficients associated with developmental math in words

Interpret the coefficient associated with high school math score in words

For each coefficient associated with developmental math do we reject the null hypothesis $H_0: \beta_k = 0$ using an alpha-level (rejection region) of .05?

Interpret the coefficient associated with “ever dropped out” in words

Interpret the coefficient for $\hat{\beta_0}$ in words

cred_mod2 <- lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, data = df_els_cc_delay) summary(cred_mod2) #> #> Call: #> lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, #> data = df_els_cc_delay) #> #> Residuals: #> Min 1Q Median 3Q Max #> -61.43 -32.76 -16.25 18.80 194.39 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 0.2327 10.6078 0.022 0.9825 #> dev_math_cat31 course 6.8860 4.3948 1.567 0.1176 #> dev_math_cat32+ courses 21.2571 4.2554 4.995 0.000000747 *** #> bytxmstd 0.8147 0.2074 3.928 0.000094201 *** #> f2everdoEvidence of a dropout episode -11.0109 4.8070 -2.291 0.0223 * #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 46 on 681 degrees of freedom #> Multiple R-squared: 0.05485, Adjusted R-squared: 0.0493 #> F-statistic: 9.88 on 4 and 681 DF, p-value: 0.00000008951 #summary(lm(formula = f3tzpostern ~ dev_math_cat4 + bytxmstd + f2everdo, data = df_els_cc_delay)) #summary(lm(formula = f3tzpostern ~ dev_math_cat4 + bytxmstd + f2everdo, data = df_els_cc))

YOUR ANSWER HERE:

/2

8. With respect to the above model, interpret the value of $R^2$ in words (fine to use “R-squared” rather than “adjusted R-squared”) and interpret the value of SER in words

YOUR ANSWER HERE:

/4

9. Choose two additional control variables that you think satisfy both conditions of omitted variable bias that you will add to the model. Separately for each of these two variables, answer the following: why might this variable satisfy the fist condition of omitted variable bias? Why might this variable satisfy the second condition of omitted variable bias? If the variable is categorical, which category will be used as the “reference category” when we include this variable in a regression model?

Note: to help you think of control variables you might add to your model, the below code shows variable labels associated with each variable (output omitted)

YOUR ANSWER HERE:

/2

10. Run the regression model with the inclusion of the two new control variables you identified above. Create an object called cred_mod3 with the results of this regression model and then summarize the object cred_mod3.

Note: if you are having trouble adding variables to your model or the output looks odd, send a DM on slack to Patricia Martín for help.

/6

11. Answer the following questions based on the regression model associated with cred_mod3:

Interpret the coefficients associated with developmental math in words

For each coefficient associated with developmental math do we reject the null hypothesis $H_0: \beta_k = 0$ using an alpha-level (rejection region) of .05?

Interpret two coefficients associated with the new control variables you decided to add to your model

Note: If you added a categorical variable with six categories, this would be associated with five regression coefficients. You don’t have to interpret the coefficients for all five coefficients. You could just interpret two of these coefficients.

YOUR ANSWER HERE:

/3

12. Below we show the regression output associated with the model cred_mod2 (control variables are high school math score and ever-droppoed out of high school). cred_mod2 is based on the data frame df_els_cc_delay which includes students who started at a community college and had at least a one-year delay between finishing high school and starting college. We also show the regression output from the model cred_mod2_allcc, includes the same variables as cred_mod2 but is based on the data frame which includes all students who started at community college, regardless of whether they delayed entry between high school and college. After running the below models, answer the following questions:

Questions to answer:

Interpret the coefficients associated with developmental math from the model cred_mod2_allcc.

In a couple of sentences, why do you think these coefficients are different than the coefficients from the model cred_mod2?

cred_mod2 <- lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, data = df_els_cc_delay) summary(cred_mod2) #> #> Call: #> lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, #> data = df_els_cc_delay) #> #> Residuals: #> Min 1Q Median 3Q Max #> -61.43 -32.76 -16.25 18.80 194.39 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 0.2327 10.6078 0.022 0.9825 #> dev_math_cat31 course 6.8860 4.3948 1.567 0.1176 #> dev_math_cat32+ courses 21.2571 4.2554 4.995 0.000000747 *** #> bytxmstd 0.8147 0.2074 3.928 0.000094201 *** #> f2everdoEvidence of a dropout episode -11.0109 4.8070 -2.291 0.0223 * #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 46 on 681 degrees of freedom #> Multiple R-squared: 0.05485, Adjusted R-squared: 0.0493 #> F-statistic: 9.88 on 4 and 681 DF, p-value: 0.00000008951 cred_mod2_allcc <- lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, data = df_els_cc) summary(cred_mod2_allcc) #> #> Call: #> lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, #> data = df_els_cc) #> #> Residuals: #> Min 1Q Median 3Q Max #> -102.13 -48.44 -12.07 44.89 225.06 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -5.6978 8.0425 -0.708 0.479 #> dev_math_cat31 course -0.6772 3.1472 -0.215 0.830 #> dev_math_cat32+ courses 12.4269 3.0045 4.136 0.0000365950466 *** #> bytxmstd 1.6166 0.1526 10.597 < 0.0000000000000002 *** #> f2everdoEvidence of a dropout episode -32.4661 4.8269 -6.726 0.0000000000219 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 58.83 on 2291 degrees of freedom #> Multiple R-squared: 0.06953, Adjusted R-squared: 0.06791 #> F-statistic: 42.8 on 4 and 2291 DF, p-value: < 0.00000000000000022

YOUR ANSWER HERE:

Part III: Effect of developmental math on transfer to 4-yr

The below code shows descriptive statistics for the dichotomous variable f3tzever4yr, which identifies whether the student ever attended a 4-year institution, for our sample of students who started at a community college and had at least a one-year delay between finishing high school and starting college

df_els_cc_delay %>% count(f3tzever4yr) #> # A tibble: 2 x 2 #> f3tzever4yr n #> <fct> <int> #> 1 No 531 #> 2 Yes 155

The below code runs a model of the relationship between developmental math ($X$) and a dichotomous measure of whether the student ever transferred to a four-year institution ($Y$), controlling for high school math score and whether the student ever dropped out of high school.

transfer_mod1 <- lm(formula = as.integer(f3tzever4yr) ~ dev_math_cat3 + bytxmstd + f2everdo, data = df_els_cc_delay) summary(transfer_mod1) #> #> Call: #> lm(formula = as.integer(f3tzever4yr) ~ dev_math_cat3 + bytxmstd + #> f2everdo, data = df_els_cc_delay) #> #> Residuals: #> Min 1Q Median 3Q Max #> -0.3926 -0.2481 -0.2028 -0.1009 0.9678 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 3.950634 0.095750 41.260 < 0.0000000000000002 *** #> dev_math_cat31 course -0.012711 0.039669 -0.320 0.74875 #> dev_math_cat32+ courses 0.039814 0.038411 1.037 0.30032 #> bytxmstd 0.005881 0.001872 3.142 0.00175 ** #> f2everdoEvidence of a dropout episode -0.084423 0.043390 -1.946 0.05210 . #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.4152 on 681 degrees of freedom #> Multiple R-squared: 0.02134, Adjusted R-squared: 0.01559 #> F-statistic: 3.712 on 4 and 681 DF, p-value: 0.005343

/8

1. Based on the above model:

Do the following:

Write out the population linear regression model (make sure to define variables)

Write out the OLS prediction line without estimate values

Write out the OLS prediction line with estimate values

Interpret the coefficients associated with developmental math in words

For each coefficient associated with developmental math do we reject the null hypothesis $H_0: \beta_k = 0$ using an alpha-level (rejection region) of .05?

YOUR ANSWER HERE:

Part IV: BONUS question (OPTIONAL)

/10

What are some things we should do (e.g., different dependent variable, additional controls varaibles in our model, additional sample restrictions) to improve the analysis so we would feel comfortable making policy recommendations based on this analysis?

Write at least a paragraph or two to explain your reasoning.

Part V: Post a comment/question

/2

Go to the class #problemsets channel and create a new post.

You can either:

Share something you learned or a question from this problem set. Make sure to mention the instructors ((???), (???) Martín).

Respond to a post made by another student.

Knit to html and submit problem set

Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”

Go to the class website and under the “Readings & Assignments” >> “Week 10” tab, click on the “Problem set 4 submission link”

Submit both your html and .Rmd files

Use this naming convention “lastname_firstname_ps#” for your .Rmd (e.g. martin_patricia_ps4.Rmd)

References

Cabrera, N. L., Milem, J. F., Jaquette, O., & Marx, R. (2014). Missing the (student achievement) forest for all the (political) trees: Empiricism and the mexican american studies controversy in tucson. American Educational Research Journal, 51(6), 1084–1118.

EDUC152, Problem Set #4

Part I: Questions about Cabrera et al. (2014)

1. In a couple sentences, what was the Mexican American Studies program and who was allowed to participate?

3. What does the conditional independendence assumption mean, in your own words?

4. What does omitted variable bias mean, in your own words? and what are the two conditions that must be satisfied for an omitted variable \(Z\) to cause omitted variable bias?

5. What is the connection between the conditional independence assumption and omitted variable bias?

6. If students had been randomly assigned to participate in MAS (as in an experiment), why would we be unconcerned about omitted variable bias?

8. In observational data (unit x not randomly assigned to values of \(X\)), when researchers use regression to estimate the causal effect of \(X\) on \(Y\), explain the primary strategy researchers use to eliminate (or at least reduce) omitted variable bias?

Part II: Effect of developmental math on college credits

2. Run the below model of the relationship between developmental math (\(X\)) and college credits earned (\(Y\)) with no control variables and then do the following.

3. Define \(R^2\) in words. Write out the mathematical formula for \(R^2\) and interpret the value of \(R^2\) from the above model in words

4. Explain what standard error of the regression (SER) means in words, write out the formula for SER (in terms of SSR), and interpret the value of SER from the above model in words

7. Run the below model of the relationship between developmental math (\(X\)) and college credits earned (\(Y\)) with the inclusion of control variables, and then do the following:

8. With respect to the above model, interpret the value of \(R^2\) in words (fine to use “R-squared” rather than “adjusted R-squared”) and interpret the value of SER in words

10. Run the regression model with the inclusion of the two new control variables you identified above. Create an object called `cred_mod3` with the results of this regression model and then summarize the object `cred_mod3`.

11. Answer the following questions based on the regression model associated with `cred_mod3`:

Part III: Effect of developmental math on transfer to 4-yr

1. Based on the above model:

Part IV: BONUS question (OPTIONAL)

Part V: Post a comment/question

Knit to html and submit problem set

References

EDUC152, Problem Set #4

Part I: Questions about Cabrera et al. (2014)

1. In a couple sentences, what was the Mexican American Studies program and who was allowed to participate?

3. What does the conditional independendence assumption mean, in your own words?

4. What does omitted variable bias mean, in your own words? and what are the two conditions that must be satisfied for an omitted variable \(Z\) to cause omitted variable bias?

5. What is the connection between the conditional independence assumption and omitted variable bias?

6. If students had been randomly assigned to participate in MAS (as in an experiment), why would we be unconcerned about omitted variable bias?

8. In observational data (unit x not randomly assigned to values of \(X\)), when researchers use regression to estimate the causal effect of \(X\) on \(Y\), explain the primary strategy researchers use to eliminate (or at least reduce) omitted variable bias?

Part II: Effect of developmental math on college credits

2. Run the below model of the relationship between developmental math (\(X\)) and college credits earned (\(Y\)) with no control variables and then do the following.

3. Define \(R^2\) in words. Write out the mathematical formula for \(R^2\) and interpret the value of \(R^2\) from the above model in words

4. Explain what standard error of the regression (SER) means in words, write out the formula for SER (in terms of SSR), and interpret the value of SER from the above model in words

7. Run the below model of the relationship between developmental math (\(X\)) and college credits earned (\(Y\)) with the inclusion of control variables, and then do the following:

8. With respect to the above model, interpret the value of \(R^2\) in words (fine to use “R-squared” rather than “adjusted R-squared”) and interpret the value of SER in words

10. Run the regression model with the inclusion of the two new control variables you identified above. Create an object called cred_mod3 with the results of this regression model and then summarize the object cred_mod3.

11. Answer the following questions based on the regression model associated with cred_mod3:

Part III: Effect of developmental math on transfer to 4-yr

1. Based on the above model:

Part IV: BONUS question (OPTIONAL)

Part V: Post a comment/question

Knit to html and submit problem set

References

10. Run the regression model with the inclusion of the two new control variables you identified above. Create an object called `cred_mod3` with the results of this regression model and then summarize the object `cred_mod3`.

11. Answer the following questions based on the regression model associated with `cred_mod3`: