In this problem set, you will be working with the IPEDS dataset from lecture. You will be given the code to create several dataframes and plotting functions we will be working on in this problem set. Part I will give you practice working with distributions and sampling distributions and part II will give you practice with hypothesis testing. Both sections will include a mix of coding questions and conceptual questions where you will need to include a written response. Part III, will ask you to make a post on the class #problemsets channel about something you learned or you can reply to another student’s post.
If you have any questions about the problem set, please also post them on the #problemsets slack channel.
Please run the code in the following chunk, which does the following:
Note: code chunk omitted from html document using include = FALSE
/5
plot_distribution() to plot the distribution of the variable norm_dist from the data frame df_generated_popWhat is the standard deviation and interpret this value in words
Does the distribution above have a normal, left-skewed, or right-skwed shape? Why?
What is the “empirical rule”? Drawing from the empirical rule, what percentage of observations in the above distribution have values between 45 and 55? between 40 and 60? between 35 and 65?
Note: Make sure you answer all parts of the question.
YOUR ANSWER HERE:
/2
plot_distribution() to plot the distribution of the variable tuitfee_grad_nres from the data frame df_ipeds_pop.df_ipeds_pop contains data on the entire population of research/master’s universities, whereas the data frame df_ipeds_sample contains data on a random sample of universities from that population.Does this variable appear to have a normal, left-skewed, or right-skewed distribution? why?
/2
plot_distribution() to plot the distribution of the variable tuitfee_grad_nres from the data frame df_ipeds_sampledf_ipeds_pop contains data on the entire population of research/master’s universities, whereas the data frame df_ipeds_sample contains data on a random sample of universities from that population.Does this variable appear to have a normal, left-skewed, or right-skewed distribution? why?
/2
/6
df_ipeds_pop.tuitfee_grad_nres.tuitfee_grad_nres.set.seed(124)
get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 200) %>%
plot_distribution(plot_title = "Sampling Distribution of the Sample mean of out-of-state graduate tuition and fees")
#same as above
#plot_distribution(get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 200),plot_title = "sampling distribution of sample mean of tuitfee_grad_nres")
Answer the following questions with respect to the above plot (one sentence or less for each answer):
n refers to sample size)
/2
df_ipeds_poptuitfee_grad_nrestuitfee_grad_nresset.seed(124)
get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 20) %>%
plot_distribution(plot_title = "Sampling distribution of sample mean of tuitfee_grad_nres")
#,plot_title = 'Sampling distribution')
Answer the following questions with respect to the above plot (one sentence or less for each answer):
/2
tuitfee_grad_nrestuitfee_grad_nres from one sampletuitfee_grad_nresset.seed(124)
plot_distribution(df_ipeds_pop$tuitfee_grad_nres, plot_title = 'Population distribution') +
plot_distribution(df_ipeds_sample$tuitfee_grad_nres, plot_title = 'Single sample distribution') +
plot_distribution(get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 200),plot_title = "sampling distribution of sample mean of tuitfee_grad_nres") +
plot_layout(ncol = 1)
State the central limit theorem in your own words and explain why it is important for hypothesis testing
In this section we will be testing a hypothesis about the variable off-campus room and board (roomboard_off).
Here is how IPEDS defines concepts related to room and board and other expenses, frome the IPEDS “Student Charges for Full Academic Year” 2019-20 academic year data dictionary [LINK]:
Here, We have included some code to help you get to know the data. Just run this code and take a look at the output
Print observations for UC campuses
df_ipeds_pop %>%
# keep UC campuses
filter(unitid %in% c(110398,110635,110644,110653,110662,110671,110680,110699,110705,110714,445188,110699,110398)) %>%
select(instnm,city,locale,roomboard_off,oth_expense_off) %>% as_factor()
#> # A tibble: 9 x 5
#> instnm city locale roomboard_off oth_expense_off
#> <chr> <chr> <fct> <dbl> <dbl>
#> 1 University of California… Berkeley City: Mids… 14771 5359
#> 2 University of California… Davis Suburb: Sm… 10588 4856
#> 3 University of California… Irvine City: Large 12861 5184
#> 4 University of California… Los Angel… City: Large 14303 5126
#> 5 University of California… Riverside City: Large 10986 4792
#> 6 University of California… La Jolla City: Large 13681 4760
#> 7 University of California… Santa Bar… Suburb: Mi… 12818 6045
#> 8 University of California… Santa Cruz City: Small 13216 5442
#> 9 University of California… Merced Rural: Fri… 8595 4909
The variable locale categorizes universities by city/suburb/town/rural and by city size
#df_ipeds_pop %>% count(locale)
df_ipeds_pop %>% count(locale) %>% as_factor()
#> # A tibble: 12 x 2
#> locale n
#> <fct> <int>
#> 1 City: Large 254
#> 2 City: Midsize 142
#> 3 City: Small 147
#> 4 Suburb: Large 199
#> 5 Suburb: Midsize 25
#> 6 Suburb: Small 27
#> 7 Town: Fringe 25
#> 8 Town: Distant 84
#> 9 Town: Remote 66
#> 10 Rural: Fringe 18
#> 11 Rural: Distant 8
#> 12 Rural: Remote 4
Average cost of off-campus room & board
mean(df_ipeds_pop$roomboard_off, na.rm = TRUE)
#> [1] 10639.56
#alternative approach for calculating mean room and board
df_ipeds_pop %>% summarize(mean_roomboard_off = mean(roomboard_off, na.rm = TRUE))
#> # A tibble: 1 x 1
#> mean_roomboard_off
#> <dbl>
#> 1 10640.
Average cost of off-campus room & board, separately for each value of locale
df_ipeds_pop %>% group_by(locale) %>% #creates a separate group for each locale
summarize(
sample_size = n(), #gets the count of colleges in each locale
mean_roomboard_off = mean(roomboard_off, na.rm = TRUE) #calculate the mean room and board costs for each locale
) %>% as_factor() #return as factor
#> # A tibble: 12 x 3
#> locale sample_size mean_roomboard_off
#> <fct> <int> <dbl>
#> 1 City: Large 254 11821.
#> 2 City: Midsize 142 10166.
#> 3 City: Small 147 10205.
#> 4 Suburb: Large 199 11123.
#> 5 Suburb: Midsize 25 11034.
#> 6 Suburb: Small 27 10597.
#> 7 Town: Fringe 25 9532.
#> 8 Town: Distant 84 8975.
#> 9 Town: Remote 66 9516.
#> 10 Rural: Fringe 18 9405.
#> 11 Rural: Distant 8 10308.
#> 12 Rural: Remote 4 8845
/5
/2
In the below questions, you will conduct hypothesis testing steps to answer the research question, “Is the population mean of off-campus room & board equal to $10,000?” You will be using the variable roomboard_off from the data frame df_ipeds_sample, which is a single random sample from the population data frame df_ipeds_pop. You will use a two-sided alternative hypothesis with an alpha level (rejection region) of .05.
/1
t.test() function to calculate the test statistic/4
plot_t_distribution() we created above to plot the sampling distribution under the assumption that \(H_0\) is true./2
Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”