Grade: /50
In this problem set, you will work with data from the Tennessee Student Teacher Achievement Ratio (STAR) project. Tennessee STAR was a massive experiment that sought to identify the effect of class size on student learning. Elementary school children were randomly assigned to one of three kinds of classrooms: small class size; regular class size; and regular class size with a teacher aide. In the lecture on causal inference and comparing two groups, we deleted student observations assigned to the “regular class size with a teacher aid” condition so that you could compare “small class size” (treatment group) to “regular class size” (control group). In this problem set, we will delete observations assigned to the “regular class size” condition and you will compare “small class size” (treatment group) to “regular class size plus teacher aide” (control group). In addition to variables about random assignment to classroom, the Tennessee STAR data contains categorical and continuous variables about the characteristics of students and their teachers. We will use the continuous variable “years of teacher experience” to run a regression that examines the relationship between years of teacher experience (\(X\)) and Kindergarten reading score (\(Y\)).
The problem set is divided into three parts:
If you have any questions about the problem set, please also post them on the #problemsets slack channel.
Some questions will ask you to write out notation and/or equations.
You can write out notation/equations one of two ways: (1) using “inline equations,” which begin with a dollar sign $ and end with a dollar sign $; OR (2) you can write out notation/equation in plain text without. We encourage you to try inline equations, but fine if you do not.
Tips on writing notation/equations using "inline equations’:
\beta: \(\beta\)\mu: \(\mu\)beta_1: \(\beta_1\)\mu_Y: \(\mu_Y\)\hat{} like this:
\hat{\beta}: \(\hat{\beta}\)\hat{\beta}_1 (note that the subscript is not within the “hat”): \(\hat{\beta}_1\)\bar{} like this:
\bar{Y}: \(\bar{Y}\)Tips on writing notation/equations in plain text
Please run the code in the following chunk, which does the following:
Note: code chunk omitted from html document using include = FALSE
Run basic frquency tabulations on the variables star and treatment
# frequency tabulation of the original classroom assignment variable named star
df_stark %>% count(star)
#> star n
#> 1 2 1739
#> 2 3 2044
df_stark %>% count(star) %>% as_factor()
#> star n
#> 1 small 1739
#> 2 regular+aide 2044
# frequency tabulation of the variable named treatment, which we created from star
df_stark %>% count(treatment)
#> treatment n
#> 1 0 2044
#> 2 1 1739
# two-way frequency tabulation of star and treatment
# basically, we run this to make sure that we created the variable treatment correctly
df_stark %>% group_by(treatment) %>% count(star)
#> # A tibble: 2 x 3
#> # Groups: treatment [2]
#> treatment star n
#> <dbl> <dbl+lbl> <int>
#> 1 0 3 [regular+aide] 2044
#> 2 1 2 [small] 1739
#df_stark %>% group_by(star) %>% count(treatment)
# compare mean reading score by treatment status
df_stark %>% group_by(treatment) %>% summarize(
n = n(),
n_nonmiss_read = sum(!is.na(read)),
read_mean = mean(read, na.rm = TRUE)
)
#> # A tibble: 2 x 4
#> treatment n n_nonmiss_read read_mean
#> <dbl> <int> <int> <dbl>
#> 1 0 2044 2044 435.
#> 2 1 1739 1739 441.
We introduce the following notation:
df_stark, each observation \(i\) represents a kindergarten studentdf_stark, the variable readdf_stark, the variable treatment/1
/2
/1
/2
/2
/2
/1
/2
Assume we know treated \(Y_i(1)\) and untreated \(Y_i(0)\) potential outcomes for all \(i\). You can fill in your answer by replacing the ? mark.
| \(i\) | \(Y_i(1)\) Treated |
\(Y_i(0)\) Untreated |
\(\tau_i\) Unit effect |
|---|---|---|---|
| 1 | 65 | 60 | ? |
| 2 | 30 | 35 | ? |
| 3 | 25 | 30 | ? |
| 4 | 80 | 70 | ? |
| 5 | 45 | 45 | ? |
/2
/2
/3
df_stark. Calculate the value of \(\hat{ATE}\) using the “difference in means” estimator for these 10 observations.df_stark %>% select(id, treatment,read) %>% head(10)
#> id treatment read
#> 1 943 1 447
#> 2 986 1 450
#> 3 1263 0 439
#> 4 2020 1 447
#> 5 2241 0 395
#> 6 3219 1 478
#> 7 3455 1 455
#> 8 3884 0 437
#> 9 4273 1 474
#> 10 4377 1 424
/1
df_stark the variable lunch identifies whether the student qualifies for free lunch (variable coded as: 1=non-free; 2=free). This variable was used as an indicator of household income because low-income students were elgible for free lunch at school. Below, we give a frequency distribution of lunch . We also show mean reading score by lunch, wich shows that students in the “non-free” lunch group have higher average reading scores than students in the “free” lunch group. Now, consider our treatment variable (1=small class; 0 = regular class + teacher aide). Imagine that, instead of being randomly assigned, students/parents self-selected into values of the treatment. Why might we be concerned that our estimator \(\bar{Y}_{treatment} - bar{Y}_{control}\) does not capture the true average treatment effect?# frequency count of lunch
df_stark %>% count(lunch)
#> lunch n
#> 1 1 1932
#> 2 2 1838
#> 3 NA 13
df_stark %>% count(lunch) %>% as_factor()
#> lunch n
#> 1 non-free 1932
#> 2 free 1838
#> 3 <NA> 13
# mean reading score by lunch
df_stark %>% group_by(lunch) %>% summarize(
mean_read = mean(read, na.rm = TRUE)
)
#> # A tibble: 3 x 2
#> lunch mean_read
#> <dbl+lbl> <dbl>
#> 1 1 [non-free] 446.
#> 2 2 [free] 429.
#> 3 NA 434.
/1
lunch) affects our ability to estimate the average treatment effect?/5
0.05.Research Question:
/3
plot_t_distribution() – of the sampling distribution assuming that \(H_0: \mu_{treatment} = \mu_{control}\) is true. Explain in your own words what is happening in the below plot and explain what the different statistics, dotted lines, and shaded areas mean.#t.test(formula = read ~ treatment, data = df_stark)
plot_t_distribution(data_df = df_stark, data_var = 'read',group_var = 'treatment', group_cat = c(1, 0), shade_pval = TRUE)
Consider the research question, what is the relationship between teacher years of teaching experience (\(X\)) and kindergarten reading score (\(Y\))?
In the data frame df_stark, teacher years of experience is measured by the variable experience.
Below is a scatterplot of the relationship between teacher years of experience \(X\) and kindergarten reading score (\(Y\))? We have also added an linear ordinary least squares (OLS) prediction line
df_stark %>% ggplot(aes(x=experience, y=read)) + geom_point() + stat_smooth(method = 'lm')
/3
lm() function create an object named mod1 that contains results from the bivariate regression of the relationship between years of teaching experience (\(X\)) and Kindergarten reading score (\(Y\)). Apply the summary() function to the object mod1 to print a summary of these regression results./6
*note: you can always use general approach for interpreting \(\hat{\beta}_1\) in words:
“On average, a one-unit increase in \(X\) is associated with a \(\hat{\beta}_1\) increase (or decrease if \(\hat{\beta}_1\) is negative) in the value of \(Y\)”
When interpreting \(\hat{\beta}_1\) in words, replace the generic “one-unit increase in \(X\)” with text that is specific to your analysis (e.g., “a one-year increase in years of teacher experience \(X\)); and do the same thing for”the value of \(Y\)"
YOUR ANSWER HERE:
/2
/7
lm() and summary()); write out the population linear regression model; write out the OLS prediction line (without estimate values); write out the OLS prediction line (with estimate values); interpet the point estimate value of \(\hat{\beta}_1\) in words;*note: you can always use general approach for interpreting \(\hat{\beta}_1\) in words:
/2
Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”