Grade: /35

Overview

In this problem set, you will be working with the IPEDS dataset from lecture. You will be given the code to create several dataframes and plotting functions we will be working on in this problem set. Part I will give you practice working with distributions and sampling distributions and part II will give you practice with hypothesis testing. Both sections will include a mix of coding questions and conceptual questions where you will need to include a written response. Part III, will ask you to make a post on the class #problemsets channel about something you learned or you can reply to another student’s post.

If you have any questions about the problem set, please also post them on the #problemsets slack channel.

Create data and load functions

Please run the code in the following chunk, which does the following:

  • Loads libraries
  • Loads and creates IPEDS data frame (population)
  • Creates data frame of generated variables (population)
  • Creates sample versions of the IPEDS and gnerated data frames

Note: code chunk omitted from html document using include = FALSE

Part I: Distributions and sampling distribution

/5

1. Use the function we created in the code chunk above plot_distribution() to plot the distribution of the variable norm_dist from the data frame df_generated_pop

What is the standard deviation and interpret this value in words

  • YOUR ANSWER HERE:

Does the distribution above have a normal, left-skewed, or right-skwed shape? Why?

  • YOUR ANSWER HERE:

What is the “empirical rule”? Drawing from the empirical rule, what percentage of observations in the above distribution have values between 45 and 55? between 40 and 60? between 35 and 65?

  • Note: Make sure you answer all parts of the question.

  • YOUR ANSWER HERE:

/2

2. Use the function we created plot_distribution() to plot the distribution of the variable tuitfee_grad_nres from the data frame df_ipeds_pop.

  • Note: the data frame df_ipeds_pop contains data on the entire population of research/master’s universities, whereas the data frame df_ipeds_sample contains data on a random sample of universities from that population.

Does this variable appear to have a normal, left-skewed, or right-skewed distribution? why?

  • YOUR ANSWER HERE:

/2

3. Use the function we created plot_distribution() to plot the distribution of the variable tuitfee_grad_nres from the data frame df_ipeds_sample

  • Note: the data frame df_ipeds_pop contains data on the entire population of research/master’s universities, whereas the data frame df_ipeds_sample contains data on a random sample of universities from that population.

Does this variable appear to have a normal, left-skewed, or right-skewed distribution? why?

  • YOUR ANSWER HERE:

/2

4. What is a sampling distribution? What is a sampling distribution of a sample mean?

  • YOUR ANSWER HERE:

/6

5. Run the following code, which does the following:

  • Takes 1000 random samples of sample size n=200 from the data frame df_ipeds_pop.
  • For each random sample, calculates the sample mean of variable tuitfee_grad_nres.
  • Plots the sampling distribution of the sample mean of variable tuitfee_grad_nres.
set.seed(124)
get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 200) %>%
  plot_distribution(plot_title = "Sampling Distribution of the Sample mean of out-of-state graduate tuition and fees")


#same as above
#plot_distribution(get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 200),plot_title = "sampling distribution of sample mean of tuitfee_grad_nres")

Answer the following questions with respect to the above plot (one sentence or less for each answer):

  • What does each observation in the above plot represent?
    • YOUR ANSWER HERE:
  • Would you describe the shape of the above distribtuion as (approximately) normal, left-skwed, or right-skewed?
    • YOUR ANSWER HERE:
  • Define what the concept “standard error” mean (referrring to sampling distribution of sample mean)?
    • YOUR ANSWER HERE:
  • Why are the concepts “standard error” and “standard deviation of the sampling distribution” equivalent?
    • YOUR ANSWER HERE:
  • Interpret the value of standard error in the above plot in words
    • YOUR ANSWER HERE:
  • Write the formula for sample standard error and state what each component of the formula refers to (e.g., n refers to sample size)
    • YOUR ANSWER HERE:

/2

6. Run the following code, which does the following:

  • Takes 1000 random samples of sample size n=20 from the data frame df_ipeds_pop
  • For each random sample, calculates the sample mean of variable tuitfee_grad_nres
  • Plots the sampling distribution of the sample mean of variable tuitfee_grad_nres
set.seed(124)
get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 20) %>%
  plot_distribution(plot_title = "Sampling distribution of sample mean of tuitfee_grad_nres")

#,plot_title = 'Sampling distribution')

Answer the following questions with respect to the above plot (one sentence or less for each answer):

  • Interpret the value of standard error in words
    • YOUR ANSWER HERE:
  • Why is the standard error from this sampling distribution (each sample has sample size n=20) larger than the sampling distribution from the previous example (each sample has sample size n=200)?
    • YOUR ANSWER HERE:

/2

7. Run the following code, which does the following:

  • Plots the population distribution of the variable tuitfee_grad_nres
  • Plots the distribution of the variable tuitfee_grad_nres from one sample
  • Plots the sampling distribution of the sample mean for the variable tuitfee_grad_nres
set.seed(124)
plot_distribution(df_ipeds_pop$tuitfee_grad_nres, plot_title = 'Population distribution') +
  plot_distribution(df_ipeds_sample$tuitfee_grad_nres, plot_title = 'Single sample distribution') +
  plot_distribution(get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 200),plot_title = "sampling distribution of sample mean of tuitfee_grad_nres") +
  plot_layout(ncol = 1)

State the central limit theorem in your own words and explain why it is important for hypothesis testing

  • YOUR ANSWER HERE:

Part II: Hypothesis testing

In this section we will be testing a hypothesis about the variable off-campus room and board (roomboard_off).

Here is how IPEDS defines concepts related to room and board and other expenses, frome the IPEDS “Student Charges for Full Academic Year” 2019-20 academic year data dictionary [LINK]:

  • “Room charges”
    • The charges for an academic year for rooming accommodations for a typical student sharing a room with one other student.
  • “Board charges”
    • The charge for an academic year for meals, for a specified number of meals per week.
  • “Other expenses”
    • The amount of money (estimated by the financial aid office) needed by a student to cover expenses such as laundry, transportation, entertainment, and furnishings. (For the purpose of this survey room and board and tuition and fees are not included.)
  • Note that most of these variables seem to be defined for an academic year rather than a 12-month calendar year.

Here, We have included some code to help you get to know the data. Just run this code and take a look at the output

Print observations for UC campuses

df_ipeds_pop %>%
  # keep UC campuses
  filter(unitid %in% c(110398,110635,110644,110653,110662,110671,110680,110699,110705,110714,445188,110699,110398)) %>%
  select(instnm,city,locale,roomboard_off,oth_expense_off) %>% as_factor()
#> # A tibble: 9 x 5
#>   instnm                    city       locale      roomboard_off oth_expense_off
#>   <chr>                     <chr>      <fct>               <dbl>           <dbl>
#> 1 University of California… Berkeley   City: Mids…         14771            5359
#> 2 University of California… Davis      Suburb: Sm…         10588            4856
#> 3 University of California… Irvine     City: Large         12861            5184
#> 4 University of California… Los Angel… City: Large         14303            5126
#> 5 University of California… Riverside  City: Large         10986            4792
#> 6 University of California… La Jolla   City: Large         13681            4760
#> 7 University of California… Santa Bar… Suburb: Mi…         12818            6045
#> 8 University of California… Santa Cruz City: Small         13216            5442
#> 9 University of California… Merced     Rural: Fri…          8595            4909

The variable locale categorizes universities by city/suburb/town/rural and by city size

#df_ipeds_pop %>% count(locale)
df_ipeds_pop %>% count(locale) %>% as_factor()
#> # A tibble: 12 x 2
#>    locale              n
#>    <fct>           <int>
#>  1 City: Large       254
#>  2 City: Midsize     142
#>  3 City: Small       147
#>  4 Suburb: Large     199
#>  5 Suburb: Midsize    25
#>  6 Suburb: Small      27
#>  7 Town: Fringe       25
#>  8 Town: Distant      84
#>  9 Town: Remote       66
#> 10 Rural: Fringe      18
#> 11 Rural: Distant      8
#> 12 Rural: Remote       4

Average cost of off-campus room & board

mean(df_ipeds_pop$roomboard_off, na.rm = TRUE)
#> [1] 10639.56

#alternative approach for calculating mean room and board
df_ipeds_pop %>% summarize(mean_roomboard_off = mean(roomboard_off, na.rm = TRUE))
#> # A tibble: 1 x 1
#>   mean_roomboard_off
#>                <dbl>
#> 1             10640.

Average cost of off-campus room & board, separately for each value of locale

df_ipeds_pop %>% group_by(locale) %>% #creates a separate group for each locale
  summarize(
    sample_size = n(), #gets the count of colleges in each locale
    mean_roomboard_off = mean(roomboard_off, na.rm = TRUE) #calculate the mean room and board costs for each locale
    ) %>% as_factor() #return as factor
#> # A tibble: 12 x 3
#>    locale          sample_size mean_roomboard_off
#>    <fct>                 <int>              <dbl>
#>  1 City: Large             254             11821.
#>  2 City: Midsize           142             10166.
#>  3 City: Small             147             10205.
#>  4 Suburb: Large           199             11123.
#>  5 Suburb: Midsize          25             11034.
#>  6 Suburb: Small            27             10597.
#>  7 Town: Fringe             25              9532.
#>  8 Town: Distant            84              8975.
#>  9 Town: Remote             66              9516.
#> 10 Rural: Fringe            18              9405.
#> 11 Rural: Distant            8             10308.
#> 12 Rural: Remote             4              8845

/5

1. What are the five steps in hypothesis testing? for each step, provide a one-sentence description.

  • YOUR ANSWER HERE:

/2

2. Hypothesis testing steps

In the below questions, you will conduct hypothesis testing steps to answer the research question, “Is the population mean of off-campus room & board equal to $10,000?” You will be using the variable roomboard_off from the data frame df_ipeds_sample, which is a single random sample from the population data frame df_ipeds_pop. You will use a two-sided alternative hypothesis with an alpha level (rejection region) of .05.

  • State the null and alternative (two-sided) hypothesis
    • YOUR ANSWER HERE:

/1

3. Use the t.test() function to calculate the test statistic

/4

4. Use function plot_t_distribution() we created above to plot the sampling distribution under the assumption that \(H_0\) is true.

  • Interpret the t-value in words and interpret the p-value in words.
    • YOUR ANSWER HERE:
  • State the conclusion about your hypothesis test.
    • YOUR ANSWER HERE:

Part III: Post a comment/question

/2

  • Go to the class #problemsets channel and create a new post.
  • You can either:
    • Share something you learned or a question from this problem set. Make sure to mention the instructors (@ozanj, @Patricia Martín).
    • Respond to a post made by another student.

Knit to html and submit problem set

Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”

  • Go to the class website and under the “Readings & Assignments” >> “Week 3” tab, click on the “Problem set 1 submission link”
  • Submit both your html and .Rmd files
  • Use this naming convention “lastname_firstname_ps#” for your .Rmd (e.g. martin_patricia_ps1.Rmd)