Load libraries and data
Note: code chunk omitted from html document using include = FALSE
Grade: /80
Overview
Problem set 4 is the take-home final for EDUC152. As such, it is longer and more challenging than previous problem sets. Because this is a take-home final, we cannot answer questions except for providing clarification. If you need clarification on a question, please send us a DM and/or come to our office hours.
The first part of the problem set asks to answer questions about the article by Cabrera, Milem, Jaquette, & Marx (2014):
The remainder of the problem set asks questions about the following research question, "What is the effect of taking developmental math courses in college (\(X\)) on student success (\(Y\)) for students who start postsecondary education at a community college and who have a delayed entry (at least a one-year delay between completing high school and starting college)?
Some questions will ask you to write out notation and/or equations.
You can write out notation/equations one of two ways: (1) using “inline equations,” which begin with a dollar sign $ and end with a dollar sign $; OR (2) you can write out notation/equation in plain text without. We encourage you to try inline equations, but fine if you do not.
Tips on writing notation/equations using "inline equations’:
\beta: \(\beta\)\mu: \(\mu\)beta_1: \(\beta_1\)\mu_Y: \(\mu_Y\)\hat{} like this:
\hat{\beta}: \(\hat{\beta}\)\hat{\beta}_1 (note that the subscript is not within the “hat”): \(\hat{\beta}_1\)\bar{} like this:
\bar{Y}: \(\bar{Y}\)Tips on writing notation/equations in plain text
/2
/2
/2
/2
/1
/3
/2
/4
The code chunk below shows some descriptive statistics about the variables in the model.
df_els_cc_delay
df_els_cc is the same as df_els_cc_delay but includes students who started at community college with no delay after finishing hich schoolf3tzposterndev_math_cat3: three category indicator of whether student took any developmental math courses in postsecondary education (based on f3tzremmttot)
# dependent variable: number of postsecondary credits
df_els_cc_delay %>% summarize(
mean_pse_cred = mean(f3tzpostern, na.rm = TRUE),
sd_pse_cred = sd(f3tzpostern, na.rm = TRUE)
)
#> # A tibble: 1 x 2
#> mean_pse_cred sd_pse_cred
#> <dbl> <dbl>
#> 1 45.1 47.2
# independent variable: taking developmental math courses
# three category
df_els_cc_delay %>% count(dev_math_cat3)
#> # A tibble: 3 x 2
#> dev_math_cat3 n
#> <fct> <int>
#> 1 0 courses 318
#> 2 1 course 173
#> 3 2+ courses 195
# four category
#df_els_cc_delay %>% count(dev_math_cat4)
# dichotomous
#df_els_cc_delay %>% count(dev_math_01)/2
/8
Do the following:
cred_mod1 <- lm(formula = f3tzpostern ~ dev_math_cat3, data = df_els_cc_delay)
summary(cred_mod1)
#>
#> Call:
#> lm(formula = f3tzpostern ~ dev_math_cat3, data = df_els_cc_delay)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -57.02 -34.06 -16.97 21.02 184.94
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 38.969 2.614 14.906 < 0.0000000000000002 ***
#> dev_math_cat31 course 4.089 4.404 0.928 0.353
#> dev_math_cat32+ courses 18.052 4.240 4.257 0.0000236 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 46.62 on 683 degrees of freedom
#> Multiple R-squared: 0.02649, Adjusted R-squared: 0.02364
#> F-statistic: 9.291 on 2 and 683 DF, p-value: 0.0001044/3
/3
/3
f3tzpostern (given below) in words. Using your judgment, does our regression model make our predictions about the value of f3tzpostern substantially better than just using the sample mean of f3tzpostern?# standard deviation of Y
sd(df_els_cc_delay$f3tzpostern, na.rm = TRUE)
#> [1] 47.17989
# Standard error of the regression
summary(cred_mod1)$sigma
#> [1] 46.619/4
bytxmstd, a continuous measure of high school standardized math score, and f2everdo a dichotomous categorical measure of whether the student ever had a “dropout episode” in high school. Run the below code which provides descriptive statistics of these variables. Separately for each of these variables, answer the following: Why might this variable satisfy the fist condition of omitted variable bias? Why might this variable satisfy the second condition of omitted variable bias?Descriptive statistics
# continuous control, high school math standardized test score
df_els_cc_delay %>% summarize(
mean_hs_math = mean(bytxmstd, na.rm = TRUE),
min_hs_math = min(bytxmstd, na.rm = TRUE),
max_hs_math = max(bytxmstd, na.rm = TRUE),
sd_hs_math = sd(bytxmstd, na.rm = TRUE)
)
#> # A tibble: 1 x 4
#> mean_hs_math min_hs_math max_hs_math sd_hs_math
#> <dbl> <dbl+lbl> <dbl+lbl> <dbl>
#> 1 47.7 22.3 75.1 8.65
# dichotomous measure of whether student ever had a dropout episode
df_els_cc_delay %>% count(f2everdo)
#> # A tibble: 2 x 2
#> f2everdo n
#> <fct> <int>
#> 1 No available evidence of dropout episode 577
#> 2 Evidence of a dropout episode 109
# dichotomous measure of whether student ever had a dropout episode, showing integer values rather than factor levels
df_els_cc_delay %>% count(as.integer(f2everdo))
#> # A tibble: 2 x 2
#> `as.integer(f2everdo)` n
#> <int> <int>
#> 1 1 577
#> 2 2 109/10
Do the following:
cred_mod2 <- lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, data = df_els_cc_delay)
summary(cred_mod2)
#>
#> Call:
#> lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo,
#> data = df_els_cc_delay)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -61.43 -32.76 -16.25 18.80 194.39
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.2327 10.6078 0.022 0.9825
#> dev_math_cat31 course 6.8860 4.3948 1.567 0.1176
#> dev_math_cat32+ courses 21.2571 4.2554 4.995 0.000000747 ***
#> bytxmstd 0.8147 0.2074 3.928 0.000094201 ***
#> f2everdoEvidence of a dropout episode -11.0109 4.8070 -2.291 0.0223 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 46 on 681 degrees of freedom
#> Multiple R-squared: 0.05485, Adjusted R-squared: 0.0493
#> F-statistic: 9.88 on 4 and 681 DF, p-value: 0.00000008951
#summary(lm(formula = f3tzpostern ~ dev_math_cat4 + bytxmstd + f2everdo, data = df_els_cc_delay))
#summary(lm(formula = f3tzpostern ~ dev_math_cat4 + bytxmstd + f2everdo, data = df_els_cc))/2
/4
Note: to help you think of control variables you might add to your model, the below code shows variable labels associated with each variable (output omitted)
/2
cred_mod3 with the results of this regression model and then summarize the object cred_mod3.Note: if you are having trouble adding variables to your model or the output looks odd, send a DM on slack to Patricia Martín for help.
/6
cred_mod3:/3
cred_mod2 (control variables are high school math score and ever-droppoed out of high school). cred_mod2 is based on the data frame df_els_cc_delay which includes students who started at a community college and had at least a one-year delay between finishing high school and starting college. We also show the regression output from the model cred_mod2_allcc, includes the same variables as cred_mod2 but is based on the data frame which includes all students who started at community college, regardless of whether they delayed entry between high school and college. After running the below models, answer the following questions:Questions to answer:
cred_mod2_allcc.cred_mod2?cred_mod2 <- lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, data = df_els_cc_delay)
summary(cred_mod2)
#>
#> Call:
#> lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo,
#> data = df_els_cc_delay)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -61.43 -32.76 -16.25 18.80 194.39
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.2327 10.6078 0.022 0.9825
#> dev_math_cat31 course 6.8860 4.3948 1.567 0.1176
#> dev_math_cat32+ courses 21.2571 4.2554 4.995 0.000000747 ***
#> bytxmstd 0.8147 0.2074 3.928 0.000094201 ***
#> f2everdoEvidence of a dropout episode -11.0109 4.8070 -2.291 0.0223 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 46 on 681 degrees of freedom
#> Multiple R-squared: 0.05485, Adjusted R-squared: 0.0493
#> F-statistic: 9.88 on 4 and 681 DF, p-value: 0.00000008951
cred_mod2_allcc <- lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo, data = df_els_cc)
summary(cred_mod2_allcc)
#>
#> Call:
#> lm(formula = f3tzpostern ~ dev_math_cat3 + bytxmstd + f2everdo,
#> data = df_els_cc)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -102.13 -48.44 -12.07 44.89 225.06
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -5.6978 8.0425 -0.708 0.479
#> dev_math_cat31 course -0.6772 3.1472 -0.215 0.830
#> dev_math_cat32+ courses 12.4269 3.0045 4.136 0.0000365950466 ***
#> bytxmstd 1.6166 0.1526 10.597 < 0.0000000000000002 ***
#> f2everdoEvidence of a dropout episode -32.4661 4.8269 -6.726 0.0000000000219 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 58.83 on 2291 degrees of freedom
#> Multiple R-squared: 0.06953, Adjusted R-squared: 0.06791
#> F-statistic: 42.8 on 4 and 2291 DF, p-value: < 0.00000000000000022The below code shows descriptive statistics for the dichotomous variable f3tzever4yr, which identifies whether the student ever attended a 4-year institution, for our sample of students who started at a community college and had at least a one-year delay between finishing high school and starting college
df_els_cc_delay %>% count(f3tzever4yr)
#> # A tibble: 2 x 2
#> f3tzever4yr n
#> <fct> <int>
#> 1 No 531
#> 2 Yes 155The below code runs a model of the relationship between developmental math (\(X\)) and a dichotomous measure of whether the student ever transferred to a four-year institution (\(Y\)), controlling for high school math score and whether the student ever dropped out of high school.
transfer_mod1 <- lm(formula = as.integer(f3tzever4yr) ~ dev_math_cat3 + bytxmstd + f2everdo, data = df_els_cc_delay)
summary(transfer_mod1)
#>
#> Call:
#> lm(formula = as.integer(f3tzever4yr) ~ dev_math_cat3 + bytxmstd +
#> f2everdo, data = df_els_cc_delay)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.3926 -0.2481 -0.2028 -0.1009 0.9678
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.950634 0.095750 41.260 < 0.0000000000000002 ***
#> dev_math_cat31 course -0.012711 0.039669 -0.320 0.74875
#> dev_math_cat32+ courses 0.039814 0.038411 1.037 0.30032
#> bytxmstd 0.005881 0.001872 3.142 0.00175 **
#> f2everdoEvidence of a dropout episode -0.084423 0.043390 -1.946 0.05210 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4152 on 681 degrees of freedom
#> Multiple R-squared: 0.02134, Adjusted R-squared: 0.01559
#> F-statistic: 3.712 on 4 and 681 DF, p-value: 0.005343/8
Do the following:
Write out the population linear regression model (make sure to define variables)
Write out the OLS prediction line without estimate values
Write out the OLS prediction line with estimate values
Interpret the coefficients associated with developmental math in words
For each coefficient associated with developmental math do we reject the null hypothesis \(H_0: \beta_k = 0\) using an alpha-level (rejection region) of .05?
YOUR ANSWER HERE:
/10
What are some things we should do (e.g., different dependent variable, additional controls varaibles in our model, additional sample restrictions) to improve the analysis so we would feel comfortable making policy recommendations based on this analysis?
/2
Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”
Cabrera, N. L., Milem, J. F., Jaquette, O., & Marx, R. (2014). Missing the (student achievement) forest for all the (political) trees: Empiricism and the mexican american studies controversy in tucson. American Educational Research Journal, 51(6), 1084–1118.