Grade: /35

Overview

In this problem set, you will be working with the IPEDS dataset from lecture. You will be given the code to create several dataframes and plotting functions we will be working on in this problem set. Part I will give you practice working with distributions and sampling distributions and part II will give you practice with hypothesis testing. Both sections will include a mix of coding questions and conceptual questions where you will need to include a written response. Part III, will ask you to make a post on the class #problemsets channel about something you learned or you can reply to another student’s post.

If you have any questions about the problem set, please also post them on the #problemsets slack channel.

Create data and load functions

Please run the code in the following chunk, which does the following:

  • Loads libraries
  • Loads and creates IPEDS data frame (population)
  • Creates data frame of generated variables (population)
  • Creates sample versions of the IPEDS and gnerated data frames

Note: code chunk omitted from html document using include = FALSE

Part I: Distributions and sampling distribution

/5

1. Use the function we created in the code chunk above plot_distribution() to plot the distribution of the variable norm_dist from the data frame df_generated_pop

plot_distribution(data_vec = df_generated_pop$norm_dist, plot_title = "Distribution of variable norm_dist")

What is the standard deviation and interpret this value in words

  • YOUR ANSWER HERE: The standard deviation is 4.99 and it reads as follows: on average, observations are 4.99 away from the mean of 49.99.

Does the distribution above have a normal, left-skewed, or right-skwed shape? Why?

  • YOUR ANSWER HERE: The distribution is a normal distribution. We can infer this because the distribution is symmetrical and gives us a “bell shape” and the mean and median are almost the same (e.g. mean = 49.99 and median = 50.01).

What is the “empirical rule”? Drawing from the empirical rule, what percentage of observations in the above distribution have values between 45 and 55? between 40 and 60? between 35 and 65?

  • Note: Make sure you answer all parts of the question.

  • YOUR ANSWER HERE: The empirical rule states that if a variable is approximately normally distributed, then : 68% of observations fall within one standard deviation from the mean, 95% of observations fall within two standard deviations from the mean, 99% of observatiosn fall within three standard deviations from the mean. This is important because it tells us the likeliness of observing a variable that is a certain number of standard deviataions away from the mean (for an approximately normally distributed variable). Drawing from this rule, we can infer that 68% of observations in the distribution have values between 45 and 55. 95% of the observations in the above distribution have values between 40 and 60 and 99% of observations in the above distribution have values between 35 and 65.

/2

2. Use the function we created plot_distribution() to plot the distribution of the variable tuitfee_grad_nres from the data frame df_ipeds_pop.

  • Note: the data frame df_ipeds_pop contains data on the entire population of research/master’s universities, whereas the data frame df_ipeds_sample contains data on a random sample of universities from that population.
plot_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, plot_title = "")

Does this variable appear to have a normal, left-skewed, or right-skewed distribution? why?

  • YOUR ANSWER HERE: The variable tuitfee_grad_nres appears to have a right-skewed distribution. It is right-skewed because the right tail is longer due to the presence of positive outliers and as such these outliers increase the value of the mean and therefore our mean is higher than our median (e.g., mean > median). From the distribution we can see that the median tuition + fees for out-of-state or nonresident graduate students is about 17552. However, there are some observations that far exceed that and are closer to 60K.

/2

3. Use the function we created plot_distribution() to plot the distribution of the variable tuitfee_grad_nres from the data frame df_ipeds_sample

  • Note: the data frame df_ipeds_pop contains data on the entire population of research/master’s universities, whereas the data frame df_ipeds_sample contains data on a random sample of universities from that population.
plot_distribution(data_vec = df_ipeds_sample$tuitfee_grad_nres, plot_title = "")

Does this variable appear to have a normal, left-skewed, or right-skewed distribution? why?

  • YOUR ANSWER HERE: The variable tuitfee_grad_nres from df_ipeds_sample appears to have a right-skewed distribution. It is right-skewed because the right tail is longer due to the presence of positive outliers and as such these outliers increase the value of the mean and therefore our mean is higher than our median (e.g., mean > median). From the distribution we can see that the median tuition + fees for out-of-state or nonresident graduate students is about 16846. However, there are some observations that far exceed that and are closer to 50-60K.

/2

4. What is a sampling distribution? What is a sampling distribution of a sample mean?

  • YOUR ANSWER HERE: The sampling distribution is the distribution of a sample statistic (e.g., mean, median, min, max) that we obtain from a number of random samples of size n that we get from the population. The sampling distribution of a sample mean is the frequency distribution where each observation is the sample mean of a single random sample from a population.

/6

5. Run the following code, which does the following:

  • Takes 1000 random samples of sample size n=200 from the data frame df_ipeds_pop.
  • For each random sample, calculates the sample mean of variable tuitfee_grad_nres.
  • Plots the sampling distribution of the sample mean of variable tuitfee_grad_nres.
set.seed(124)
get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 200) %>%
  plot_distribution(plot_title = "Sampling Distribution of the Sample mean of out-of-state graduate tuition and fees")


#same as above
#plot_distribution(get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 200),plot_title = "sampling distribution of sample mean of tuitfee_grad_nres")

Answer the following questions with respect to the above plot (one sentence or less for each answer):

  • What does each observation in the above plot represent?
    • YOUR ANSWER HERE: a sample mean from one random sample
  • Would you describe the shape of the above distribtuion as (approximately) normal, left-skwed, or right-skewed?
    • YOUR ANSWER HERE: normal
  • Define what the concept “standard error” mean (referrring to sampling distribution of sample mean)?
    • YOUR ANSWER HERE: The standard error refers to a sampling distribution and is the average distance between a sample mean from one random sample and the mean of all sample means.
  • Why are the concepts “standard error” and “standard deviation of the sampling distribution” equivalent?
    • YOUR ANSWER HERE: The standard error is the standard deviation where each observation is a sample mean as opposed to a single data point.
  • Interpret the value of standard error in the above plot in words
    • YOUR ANSWER HERE: On average a sample mean from one random sample is about 581 away from the mean of all sample means.
  • Write the formula for sample standard error and state what each component of the formula refers to (e.g., n refers to sample size)
    • YOUR ANSWER HERE: The sample standard deviation is the average distance between a random observation and the sample mean. To get the standard deviation we get the square root of the sum of the difference betweeb a random observation and the sample mean squared divided by the sample size - 1. The sample standard error of the sample mean is the average distance between one random sample mean and the mean of all sample means. To get the sample standard error we divide the sample standard deviation by the square root of the sample size.
      • Sample standard deviation = \(\hat{\sigma}_Y = \sqrt{\frac{\sum_{i=1}^n (Y_i - \overline{Y})^2}{n-1}}\)
      • Sample standard error = \(\hat{\sigma}_{\bar{Y}} = \hat{\sigma}_{Y}/\sqrt{n}\)

/2

6. Run the following code, which does the following:

  • Takes 1000 random samples of sample size n=20 from the data frame df_ipeds_pop
  • For each random sample, calculates the sample mean of variable tuitfee_grad_nres
  • Plots the sampling distribution of the sample mean of variable tuitfee_grad_nres
set.seed(124)
get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 20) %>%
  plot_distribution(plot_title = "Sampling distribution of sample mean of tuitfee_grad_nres")

#,plot_title = 'Sampling distribution')

Answer the following questions with respect to the above plot (one sentence or less for each answer):

  • Interpret the value of standard error in words
    • YOUR ANSWER HERE: On average, the sample mean from one random sample is 2112 away from the mean of all sample means.
  • Why is the standard error from this sampling distribution (each sample has sample size n=20) larger than the sampling distribution from the previous example (each sample has sample size n=200)?
    • YOUR ANSWER HERE: The standard error from the sampling distribution with a sample size of 20 is larger because the sample size for each random sample is smaller. The larger our sample size, the smaller our standard error is. Which means the smaller our standard error, the more precise our estimates.

/2

7. Run the following code, which does the following:

  • Plots the population distribution of the variable tuitfee_grad_nres
  • Plots the distribution of the variable tuitfee_grad_nres from one sample
  • Plots the sampling distribution of the sample mean for the variable tuitfee_grad_nres
set.seed(124)
plot_distribution(df_ipeds_pop$tuitfee_grad_nres, plot_title = 'Population distribution') +
  plot_distribution(df_ipeds_sample$tuitfee_grad_nres, plot_title = 'Single sample distribution') +
  plot_distribution(get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 200),plot_title = "sampling distribution of sample mean of tuitfee_grad_nres") +
  plot_layout(ncol = 1)

State the central limit theorem in your own words and explain why it is important for hypothesis testing

  • YOUR ANSWER HERE: The central limit theorem is important when conducting hypothesis tests about a population parameter (e.g., about a population mean, about a population regression coefficient), based on the sampling distribution of the relevant sample statistic. If the sampling distribution has a normal distribution, then we know the percent of the observations that we are a certain number of standard deviations from the mean.

Part II: Hypothesis testing

In this section we will be testing a hypothesis about the variable off-campus room and board (roomboard_off).

Here is how IPEDS defines concepts related to room and board and other expenses, frome the IPEDS “Student Charges for Full Academic Year” 2019-20 academic year data dictionary [LINK]:

  • “Room charges”
    • The charges for an academic year for rooming accommodations for a typical student sharing a room with one other student.
  • “Board charges”
    • The charge for an academic year for meals, for a specified number of meals per week.
  • “Other expenses”
    • The amount of money (estimated by the financial aid office) needed by a student to cover expenses such as laundry, transportation, entertainment, and furnishings. (For the purpose of this survey room and board and tuition and fees are not included.)
  • Note that most of these variables seem to be defined for an academic year rather than a 12-month calendar year.

Here, We have included some code to help you get to know the data. Just run this code and take a look at the output

Print observations for UC campuses

df_ipeds_pop %>%
  # keep UC campuses
  filter(unitid %in% c(110398,110635,110644,110653,110662,110671,110680,110699,110705,110714,445188,110699,110398)) %>%
  select(instnm,city,locale,roomboard_off,oth_expense_off) %>% as_factor()
#> # A tibble: 9 x 5
#>   instnm                    city       locale      roomboard_off oth_expense_off
#>   <chr>                     <chr>      <fct>               <dbl>           <dbl>
#> 1 University of California… Berkeley   City: Mids…         14771            5359
#> 2 University of California… Davis      Suburb: Sm…         10588            4856
#> 3 University of California… Irvine     City: Large         12861            5184
#> 4 University of California… Los Angel… City: Large         14303            5126
#> 5 University of California… Riverside  City: Large         10986            4792
#> 6 University of California… La Jolla   City: Large         13681            4760
#> 7 University of California… Santa Bar… Suburb: Mi…         12818            6045
#> 8 University of California… Santa Cruz City: Small         13216            5442
#> 9 University of California… Merced     Rural: Fri…          8595            4909

The variable locale categorizes universities by city/suburb/town/rural and by city size

#df_ipeds_pop %>% count(locale)
df_ipeds_pop %>% count(locale) %>% as_factor()
#> # A tibble: 12 x 2
#>    locale              n
#>    <fct>           <int>
#>  1 City: Large       254
#>  2 City: Midsize     142
#>  3 City: Small       147
#>  4 Suburb: Large     199
#>  5 Suburb: Midsize    25
#>  6 Suburb: Small      27
#>  7 Town: Fringe       25
#>  8 Town: Distant      84
#>  9 Town: Remote       66
#> 10 Rural: Fringe      18
#> 11 Rural: Distant      8
#> 12 Rural: Remote       4

Average cost of off-campus room & board

mean(df_ipeds_pop$roomboard_off, na.rm = TRUE)
#> [1] 10639.56

#alternative approach for calculating mean room and board
df_ipeds_pop %>% summarize(mean_roomboard_off = mean(roomboard_off, na.rm = TRUE))
#> # A tibble: 1 x 1
#>   mean_roomboard_off
#>                <dbl>
#> 1             10640.

Average cost of off-campus room & board, separately for each value of locale

df_ipeds_pop %>% group_by(locale) %>% #creates a separate group for each locale
  summarize(
    sample_size = n(), #gets the count of colleges in each locale
    mean_roomboard_off = mean(roomboard_off, na.rm = TRUE) #calculate the mean room and board costs for each locale
    ) %>% as_factor() #return as factor
#> # A tibble: 12 x 3
#>    locale          sample_size mean_roomboard_off
#>    <fct>                 <int>              <dbl>
#>  1 City: Large             254             11821.
#>  2 City: Midsize           142             10166.
#>  3 City: Small             147             10205.
#>  4 Suburb: Large           199             11123.
#>  5 Suburb: Midsize          25             11034.
#>  6 Suburb: Small            27             10597.
#>  7 Town: Fringe             25              9532.
#>  8 Town: Distant            84              8975.
#>  9 Town: Remote             66              9516.
#> 10 Rural: Fringe            18              9405.
#> 11 Rural: Distant            8             10308.
#> 12 Rural: Remote             4              8845

/5

1. What are the five steps in hypothesis testing? for each step, provide a one-sentence description.

  • YOUR ANSWER HERE:
      1. Hypothesis - You have to formally state your null and alternative hypothesis
      1. Assumptions - Assumptions that we make based on our statistical test. If assumptions are met, we can make an inference about the population parameter by applying the statistical test to the sample data.
      1. Test statistic - A statistical analysis we use to test our hypothesis
      1. p-value - We use the p-value to calculate the probability of observing a test statistic as large or larger as the one we calculate.
      1. alpha level - Have to decide on the alpha level before running analysis and compare the alpha (e.g. .05) with the p-value to make a conclusion about our hypothesis test.

/2

2. Hypothesis testing steps

In the below questions, you will conduct hypothesis testing steps to answer the research question, “Is the population mean of off-campus room & board equal to $10,000?” You will be using the variable roomboard_off from the data frame df_ipeds_sample, which is a single random sample from the population data frame df_ipeds_pop. You will use a two-sided alternative hypothesis with an alpha level (rejection region) of .05.

  • State the null and alternative (two-sided) hypothesis
    • YOUR ANSWER HERE:
  • Null hypothesis, \(H_0\)
    • \(H_0: \mu_Y = \mu_{Y0} = \$10,000\)
    • \(H_0:\) population mean price of off-campus room & board is $10,000
  • Alternative hypothesis, \(H_a\)
    • \(H_a: \mu_Y \ne \$10,000\)
    • \(H_a:\) population mean price of off-campus room & board is not equal to $10,000

/1

3. Use the t.test() function to calculate the test statistic

t.test(x = df_ipeds_sample$roomboard_off, mu = 10000)
#> 
#>  One Sample t-test
#> 
#> data:  df_ipeds_sample$roomboard_off
#> t = 1.8462, df = 199, p-value = 0.06635
#> alternative hypothesis: true mean is not equal to 10000
#> 95 percent confidence interval:
#>   9973.373 10808.527
#> sample estimates:
#> mean of x 
#>  10390.95

/4

4. Use function plot_t_distribution() we created above to plot the sampling distribution under the assumption that \(H_0\) is true.

plot_t_distribution(df_ipeds_sample$roomboard_off, mu = 10000)

  • Interpret the t-value in words and interpret the p-value in words.
    • YOUR ANSWER HERE: Our t-value (1.85) is less than our critical value (1.97) and so we won’t reject the \(H_0\). Our p-value of 0.06 is greater than our alpha level of .05 so we won’t reject the \(H_0\).
  • State the conclusion about your hypothesis test.
    • YOUR ANSWER HERE: We do not have sufficient evidence to reject the null hypothesis, \(H_0\), that the population mean price of off-campus room & board is $10,000.

Part III: Post a comment/question

/2

  • Go to the class #problemsets channel and create a new post.
  • You can either:
    • Share something you learned or a question from this problem set. Make sure to mention the instructors (@ozanj, @Patricia Martín).
    • Respond to a post made by another student.

Knit to html and submit problem set

Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”

  • Go to the class website and under the “Readings & Assignments” >> “Week 3” tab, click on the “Problem set 1 submission link”
  • Submit both your html and .Rmd files
  • Use this naming convention “lastname_firstname_ps#” for your .Rmd (e.g. martin_patricia_ps1.Rmd)