EDUC152, Problem Set #1

Grade: /35

Overview

In this problem set, you will be working with the IPEDS dataset from lecture. You will be given the code to create several dataframes and plotting functions we will be working on in this problem set. Part I will give you practice working with distributions and sampling distributions and part II will give you practice with hypothesis testing. Both sections will include a mix of coding questions and conceptual questions where you will need to include a written response. Part III, will ask you to make a post on the class #problemsets channel about something you learned or you can reply to another student’s post.

If you have any questions about the problem set, please also post them on the #problemsets slack channel.

Create data and load functions

Please run the code in the following chunk, which does the following:

Loads libraries

Loads and creates IPEDS data frame (population)

Creates data frame of generated variables (population)

Creates sample versions of the IPEDS and gnerated data frames

Note: code chunk omitted from html document using include = FALSE

Part I: Distributions and sampling distribution

/5

1. Use the function we created in the code chunk above plot_distribution() to plot the distribution of the variable norm_dist from the data frame df_generated_pop

What is the standard deviation and interpret this value in words

YOUR ANSWER HERE:

Does the distribution above have a normal, left-skewed, or right-skwed shape? Why?

YOUR ANSWER HERE:

What is the “empirical rule”? Drawing from the empirical rule, what percentage of observations in the above distribution have values between 45 and 55? between 40 and 60? between 35 and 65?

Note: Make sure you answer all parts of the question.

YOUR ANSWER HERE:

/2

2. Use the function we created plot_distribution() to plot the distribution of the variable tuitfee_grad_nres from the data frame df_ipeds_pop.

Note: the data frame df_ipeds_pop contains data on the entire population of research/master’s universities, whereas the data frame df_ipeds_sample contains data on a random sample of universities from that population.

Does this variable appear to have a normal, left-skewed, or right-skewed distribution? why?

YOUR ANSWER HERE:

/2

3. Use the function we created plot_distribution() to plot the distribution of the variable tuitfee_grad_nres from the data frame df_ipeds_sample

Note: the data frame df_ipeds_pop contains data on the entire population of research/master’s universities, whereas the data frame df_ipeds_sample contains data on a random sample of universities from that population.

Does this variable appear to have a normal, left-skewed, or right-skewed distribution? why?

YOUR ANSWER HERE:

/2

4. What is a sampling distribution? What is a sampling distribution of a sample mean?

YOUR ANSWER HERE:

/6

5. Run the following code, which does the following:

Takes 1000 random samples of sample size n=200 from the data frame df_ipeds_pop.

For each random sample, calculates the sample mean of variable tuitfee_grad_nres.

Plots the sampling distribution of the sample mean of variable tuitfee_grad_nres.

set.seed(124) get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 200) %>% plot_distribution(plot_title = "Sampling Distribution of the Sample mean of out-of-state graduate tuition and fees")

#same as above #plot_distribution(get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 200),plot_title = "sampling distribution of sample mean of tuitfee_grad_nres")

Answer the following questions with respect to the above plot (one sentence or less for each answer):

What does each observation in the above plot represent?

YOUR ANSWER HERE:

Would you describe the shape of the above distribtuion as (approximately) normal, left-skwed, or right-skewed?

YOUR ANSWER HERE:

Define what the concept “standard error” mean (referrring to sampling distribution of sample mean)?

YOUR ANSWER HERE:

Why are the concepts “standard error” and “standard deviation of the sampling distribution” equivalent?

YOUR ANSWER HERE:

Interpret the value of standard error in the above plot in words

YOUR ANSWER HERE:

Write the formula for sample standard error and state what each component of the formula refers to (e.g., n refers to sample size)

YOUR ANSWER HERE:

/2

6. Run the following code, which does the following:

Takes 1000 random samples of sample size n=20 from the data frame df_ipeds_pop

For each random sample, calculates the sample mean of variable tuitfee_grad_nres

Plots the sampling distribution of the sample mean of variable tuitfee_grad_nres

set.seed(124) get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 20) %>% plot_distribution(plot_title = "Sampling distribution of sample mean of tuitfee_grad_nres")

#,plot_title = 'Sampling distribution')

Answer the following questions with respect to the above plot (one sentence or less for each answer):

Interpret the value of standard error in words

YOUR ANSWER HERE:

Why is the standard error from this sampling distribution (each sample has sample size n=20) larger than the sampling distribution from the previous example (each sample has sample size n=200)?

YOUR ANSWER HERE:

/2

7. Run the following code, which does the following:

Plots the population distribution of the variable tuitfee_grad_nres

Plots the distribution of the variable tuitfee_grad_nres from one sample

Plots the sampling distribution of the sample mean for the variable tuitfee_grad_nres

set.seed(124) plot_distribution(df_ipeds_pop$tuitfee_grad_nres, plot_title = 'Population distribution') + plot_distribution(df_ipeds_sample$tuitfee_grad_nres, plot_title = 'Single sample distribution') + plot_distribution(get_sampling_distribution(data_vec = df_ipeds_pop$tuitfee_grad_nres, num_samples = 1000, sample_size = 200),plot_title = "sampling distribution of sample mean of tuitfee_grad_nres") + plot_layout(ncol = 1)

State the central limit theorem in your own words and explain why it is important for hypothesis testing

YOUR ANSWER HERE:

Part II: Hypothesis testing

In this section we will be testing a hypothesis about the variable off-campus room and board (roomboard_off).

Here is how IPEDS defines concepts related to room and board and other expenses, frome the IPEDS “Student Charges for Full Academic Year” 2019-20 academic year data dictionary [LINK]:

“Room charges”

The charges for an academic year for rooming accommodations for a typical student sharing a room with one other student.

“Board charges”

The charge for an academic year for meals, for a specified number of meals per week.

“Other expenses”

The amount of money (estimated by the financial aid office) needed by a student to cover expenses such as laundry, transportation, entertainment, and furnishings. (For the purpose of this survey room and board and tuition and fees are not included.)

Note that most of these variables seem to be defined for an academic year rather than a 12-month calendar year.

Here, We have included some code to help you get to know the data. Just run this code and take a look at the output

Print observations for UC campuses

df_ipeds_pop %>% # keep UC campuses filter(unitid %in% c(110398,110635,110644,110653,110662,110671,110680,110699,110705,110714,445188,110699,110398)) %>% select(instnm,city,locale,roomboard_off,oth_expense_off) %>% as_factor() #> # A tibble: 9 x 5 #> instnm city locale roomboard_off oth_expense_off #> <chr> <chr> <fct> <dbl> <dbl> #> 1 University of California… Berkeley City: Mids… 14771 5359 #> 2 University of California… Davis Suburb: Sm… 10588 4856 #> 3 University of California… Irvine City: Large 12861 5184 #> 4 University of California… Los Angel… City: Large 14303 5126 #> 5 University of California… Riverside City: Large 10986 4792 #> 6 University of California… La Jolla City: Large 13681 4760 #> 7 University of California… Santa Bar… Suburb: Mi… 12818 6045 #> 8 University of California… Santa Cruz City: Small 13216 5442 #> 9 University of California… Merced Rural: Fri… 8595 4909

The variable locale categorizes universities by city/suburb/town/rural and by city size

#df_ipeds_pop %>% count(locale) df_ipeds_pop %>% count(locale) %>% as_factor() #> # A tibble: 12 x 2 #> locale n #> <fct> <int> #> 1 City: Large 254 #> 2 City: Midsize 142 #> 3 City: Small 147 #> 4 Suburb: Large 199 #> 5 Suburb: Midsize 25 #> 6 Suburb: Small 27 #> 7 Town: Fringe 25 #> 8 Town: Distant 84 #> 9 Town: Remote 66 #> 10 Rural: Fringe 18 #> 11 Rural: Distant 8 #> 12 Rural: Remote 4

Average cost of off-campus room & board

mean(df_ipeds_pop$roomboard_off, na.rm = TRUE) #> [1] 10639.56 #alternative approach for calculating mean room and board df_ipeds_pop %>% summarize(mean_roomboard_off = mean(roomboard_off, na.rm = TRUE)) #> # A tibble: 1 x 1 #> mean_roomboard_off #> <dbl> #> 1 10640.

Average cost of off-campus room & board, separately for each value of locale

df_ipeds_pop %>% group_by(locale) %>% #creates a separate group for each locale summarize( sample_size = n(), #gets the count of colleges in each locale mean_roomboard_off = mean(roomboard_off, na.rm = TRUE) #calculate the mean room and board costs for each locale ) %>% as_factor() #return as factor #> # A tibble: 12 x 3 #> locale sample_size mean_roomboard_off #> <fct> <int> <dbl> #> 1 City: Large 254 11821. #> 2 City: Midsize 142 10166. #> 3 City: Small 147 10205. #> 4 Suburb: Large 199 11123. #> 5 Suburb: Midsize 25 11034. #> 6 Suburb: Small 27 10597. #> 7 Town: Fringe 25 9532. #> 8 Town: Distant 84 8975. #> 9 Town: Remote 66 9516. #> 10 Rural: Fringe 18 9405. #> 11 Rural: Distant 8 10308. #> 12 Rural: Remote 4 8845

/5

1. What are the five steps in hypothesis testing? for each step, provide a one-sentence description.

YOUR ANSWER HERE:

/2

2. Hypothesis testing steps

In the below questions, you will conduct hypothesis testing steps to answer the research question, “Is the population mean of off-campus room & board equal to $10,000?” You will be using the variable roomboard_off from the data frame df_ipeds_sample, which is a single random sample from the population data frame df_ipeds_pop. You will use a two-sided alternative hypothesis with an alpha level (rejection region) of .05.

State the null and alternative (two-sided) hypothesis

YOUR ANSWER HERE:

/1

3. Use the t.test() function to calculate the test statistic

/4

4. Use function plot_t_distribution() we created above to plot the sampling distribution under the assumption that $H_0$ is true.

Interpret the t-value in words and interpret the p-value in words.

YOUR ANSWER HERE:

State the conclusion about your hypothesis test.

YOUR ANSWER HERE:

Part III: Post a comment/question

/2

Go to the class #problemsets channel and create a new post.

You can either:

Share something you learned or a question from this problem set. Make sure to mention the instructors (@ozanj, @Patricia Martín).

Respond to a post made by another student.

Knit to html and submit problem set

Knit to html by clicking the “Knit” button near the top of your RStudio window (icon with blue yarn ball) or drop down and select “Knit to HTML”

Go to the class website and under the “Readings & Assignments” >> “Week 3” tab, click on the “Problem set 1 submission link”

Submit both your html and .Rmd files

Use this naming convention “lastname_firstname_ps#” for your .Rmd (e.g. martin_patricia_ps1.Rmd)

EDUC152, Problem Set #1

Grade: /35

Overview

Create data and load functions

Part I: Distributions and sampling distribution

1. Use the function we created in the code chunk above `plot_distribution()` to plot the distribution of the variable `norm_dist` from the data frame `df_generated_pop`

2. Use the function we created `plot_distribution()` to plot the distribution of the variable `tuitfee_grad_nres` from the data frame `df_ipeds_pop`.

3. Use the function we created `plot_distribution()` to plot the distribution of the variable `tuitfee_grad_nres` from the data frame `df_ipeds_sample`

4. What is a sampling distribution? What is a sampling distribution of a sample mean?

5. Run the following code, which does the following:

6. Run the following code, which does the following:

7. Run the following code, which does the following:

Part II: Hypothesis testing

1. What are the five steps in hypothesis testing? for each step, provide a one-sentence description.

2. Hypothesis testing steps

3. Use the `t.test()` function to calculate the test statistic

4. Use function `plot_t_distribution()` we created above to plot the sampling distribution under the assumption that \(H_0\) is true.

Part III: Post a comment/question

Knit to html and submit problem set

EDUC152, Problem Set #1

Grade: /35

Overview

Create data and load functions

Part I: Distributions and sampling distribution

1. Use the function we created in the code chunk above plot_distribution() to plot the distribution of the variable norm_dist from the data frame df_generated_pop

2. Use the function we created plot_distribution() to plot the distribution of the variable tuitfee_grad_nres from the data frame df_ipeds_pop.

3. Use the function we created plot_distribution() to plot the distribution of the variable tuitfee_grad_nres from the data frame df_ipeds_sample

4. What is a sampling distribution? What is a sampling distribution of a sample mean?

5. Run the following code, which does the following:

6. Run the following code, which does the following:

7. Run the following code, which does the following:

Part II: Hypothesis testing

1. What are the five steps in hypothesis testing? for each step, provide a one-sentence description.

2. Hypothesis testing steps

3. Use the t.test() function to calculate the test statistic

4. Use function plot_t_distribution() we created above to plot the sampling distribution under the assumption that \(H_0\) is true.

Part III: Post a comment/question

Knit to html and submit problem set

1. Use the function we created in the code chunk above `plot_distribution()` to plot the distribution of the variable `norm_dist` from the data frame `df_generated_pop`

2. Use the function we created `plot_distribution()` to plot the distribution of the variable `tuitfee_grad_nres` from the data frame `df_ipeds_pop`.

3. Use the function we created `plot_distribution()` to plot the distribution of the variable `tuitfee_grad_nres` from the data frame `df_ipeds_sample`

3. Use the `t.test()` function to calculate the test statistic

4. Use function `plot_t_distribution()` we created above to plot the sampling distribution under the assumption that \(H_0\) is true.