Grade: /10

Overview

The purpose of this short exercise is to give you some practice with the basics of multivariate regression. In this exercise you will:

  • Run a regression model in R (code provided)
  • Write out the population linear regression model
  • Write out the OLS prediction line
    • with estimates
    • without estimates
  • Interpret value of regression coefficients
  • Calculate predicted values

Load libraries and dataset

# remove scientific notation
options(scipen=999)

##########
########## Libraries
##########

  library(tidyverse)
  library(labelled)
  library(haven)

##########
########## ELS:2002 data
##########

# RUN SCRIPT THAT CREATES STUDENT-LEVEL DATA FRAME CONTAINING ALL VARIABLES AND CREATES DATA FRAME WITH A SUBSET OF VARIABLES

  #NOTE: this script will take 30 seconds to a minute to run because loading a dataset w/ about 16,000 observations and 4,000 variables from a website

  source(file = url('https://github.com/anyone-can-cook/educ152/raw/main/scripts/els/read_els_by_pets.R'))
    #source(file = file.path('.','..','..','scripts','els','read_els_by_pets.R'))
      #list.files(path = file.path('.','..','..','scripts','els'))

# Create a dataframe df_els_stu_fac that has categorical variables as factor class variables rather than labelled class variables
  df_els_stu_fac <- as_factor(df_els_stu, only_labelled = TRUE) %>%
    # create a version of parent income that is in $1000s
    mutate(parent_income000 = parent_income/1000)
  # convert continuous variables we know we want numeric back to numeric
  for (v in c('bytxmstd','bytxrstd','f1txmstd','f3stloanamt','f3stloanpay','f3ern2011','f3tzrectrans','f3tzreqtrans','f3tzschtotal')) {
    df_els_stu_fac[[v]] <- df_els_stu[[v]]  
  }

Run descriptive statistics and regression

Variables

  • The dependent variable is high school reading test score, bytxrstd
    • variable label: Reading test standardized score
  • The continuous independent variable is parent_income000
    • variable label: continuous measure of base year parental household income, calculated from categorical variable byincome
    • Note. This is parent income in $thousands (e.g., value of 62.5 refers to $62,500)
      • so a “one-unit” increase in this variable would be a $1,000 increase in parent income
  • The categorical independent variable is school control (e.g. public school, Catholic private school, or non-Catholic private school) bysctrl
    • variable label: School control

Your job in this section is just to run the provided code

  • Descriptive statistics about variable in the model
df_els_stu_fac %>% select(bytxrstd,parent_income000,bysctrl) %>% glimpse()
#> Rows: 8,910
#> Columns: 3
#> $ bytxrstd         <dbl+lbl> 56.70, 64.46, 48.69, 33.53, 40.80, 41.05, 56.33,…
#> $ parent_income000 <dbl> 87.5, 62.5, 0.5, 17.5, 62.5, 3.0, 30.0, 30.0, 150.0,…
#> $ bysctrl          <fct> Public, Public, Public, Public, Public, Public, Publ…

# dependent variable
df_els_stu_fac %>% summarize(
  mean_read_score = mean(bytxrstd, na.rm = TRUE),
  sd_read_score = sd(bytxrstd, na.rm = TRUE)
)
#> # A tibble: 1 x 2
#>   mean_read_score sd_read_score
#>             <dbl>         <dbl>
#> 1            53.3          9.30

# continuous independent variable
df_els_stu_fac %>% summarize(
  mean_parent_income = mean(parent_income000, na.rm = TRUE),
  sd_parent_income = sd(parent_income000, na.rm = TRUE)
)
#> # A tibble: 1 x 2
#>   mean_parent_income sd_parent_income
#>                <dbl>            <dbl>
#> 1               71.3             58.2

# categorical independent variable
df_els_stu_fac %>% count(bysctrl)
#> # A tibble: 3 x 2
#>   bysctrl           n
#>   <fct>         <int>
#> 1 Public         6486
#> 2 Catholic       1465
#> 3 Other private   959

  # categorical independent variable, showing value of underlying integer values
 df_els_stu_fac %>% count(as.integer(bysctrl))
#> # A tibble: 3 x 2
#>   `as.integer(bysctrl)`     n
#>                   <int> <int>
#> 1                     1  6486
#> 2                     2  1465
#> 3                     3   959
  • Run regression model
mod1 <- lm(formula = bytxrstd ~ parent_income000 + bysctrl, data = df_els_stu_fac %>% filter(f2enroll0506=='yes'))

summary(mod1)
#> 
#> Call:
#> lm(formula = bytxrstd ~ parent_income000 + bysctrl, data = df_els_stu_fac %>% 
#>     filter(f2enroll0506 == "yes"))
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -30.2574  -5.6841   0.1432   6.0089  25.5002 
#> 
#> Coefficients:
#>                       Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)          51.167236   0.167530 305.421 < 0.0000000000000002 ***
#> parent_income000      0.033481   0.001762  19.004 < 0.0000000000000002 ***
#> bysctrlCatholic       1.980870   0.270856   7.313    0.000000000000288 ***
#> bysctrlOther private  2.300603   0.329044   6.992    0.000000000002954 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 8.69 on 7315 degrees of freedom
#> Multiple R-squared:  0.07423,    Adjusted R-squared:  0.07385 
#> F-statistic: 195.5 on 3 and 7315 DF,  p-value: < 0.00000000000000022

Questions for you to answer

/3

1. Write out the population linear regression model (make sure to define which variable (e.g., “parental income”) is associated with which \(X_{ki}\) in the model; and define unit of analysis if relevant).

  • YOUR ANSWER HERE:

  • Population linear regression model (multivariate regression)

    • \(Y_i = \beta_0 + \beta_1 X_{1i}+ \beta_2 X_{2i}+ \beta_3 X_{3i}+ u_i\), where:
      • \(Y_i\): is high school reading test score for student \(i\)
      • \(X_{1i}\): (continuous) is parental income in $1,000s
      • \(X_{2i}\): (dichotomous) student \(i\) attends a Catholic private school
      • \(X_{3i}\): (dichotomous) student \(i\) attends a non-Catholic private school
      • \(u_i\) is all variables that affect \(Y_i\) but were not included in our model

/2

2. Write out the OLS prediction line without estimate values and write out the OLS prediction line with estimate values.

YOUR ANSWER HERE:

  • OLS prediction line without estimates
    • \(\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1}X_{1i} + \hat{\beta_2} X_{2i}+ \hat{\beta_3} X_{3i}\)
  • OLS prediction line with estimates
    • \(\hat{Y_i} =\) 51.17 + 0.03 \(\times X_{1i}\) + 1.98 \(\times X_{2i}\) + 2.3 \(\times X_{3i}\)

/3

3. Interpret the value of regression coefficients \(\hat{\beta_1}\), \(\hat{\beta_2}\), and \(\hat{\beta_3}\) in words.

YOUR ANSWER HERE:

  • \(X_{1i}\) (parental income in $ thousands); \(\hat{\beta_1}=\) 0.03
    • a one thousand dollar increase in parental income is associated with a 0.03 point change in high school reading test score.
  • \(X_{2i}\) (attends a Catholic private school); \(\hat{\beta_2}=\) 1.98
    • attending a Catholic private school as opposed to a public school is, on average, is associated with a 1.98 change in high school reading test score, holding the value of all other \(X\) variables constant.
  • \(X_{3i}\) (attends a non-Catholic private school); \(\hat{\beta_3}=\) 2.3
    • attending a non-Catholic private school as opposed to a public school is, on average, is associated with a 2.3 change in high school reading test score, holding the value of all other \(X\) variables constant.

/1

4. Interpret the value of the regression coefficients \(\hat{\beta_0}\) in words.

YOUR ANSWER HERE:

  • \(\hat{\beta_0}=\) 51.17
  • The estimated average high school test score for a student with parental income of 0 who attended a public high school is 51.17

/1

5. Calculate the predicted high school reading test score for a student who attended a non-Catholic private school and who has parental_income000 = 150 (i.e., $150,000); show your work

YOUR ANSWER HERE:

  • \(\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1}\times 150 + \hat{\beta_2} \times 0 + \hat{\beta_3} \times 1\)
  • \(\hat{Y_i} =\) 51.17 + 0.03 \(\times 150\) + 2.3 \(\times 1=\) 57.97

Knit to html and submit exercise

Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”

  • Go to the class website and under the “Readings & Assignments” >> “Week 9” tab, click on the “Short exercise submission link”
  • Submit both your html and .Rmd files
  • Use this naming convention “lastname_firstname_se” for your .Rmd (e.g. martin_patricia_se.Rmd)