Instructions
- We are going to work through some concepts discussed in the previous week’s lecture (e.g. sampling distribution, central limit theorem, working with user-written functions).
Task
Please run the code in the following chunk, which does the following:
- Loads libraries
- Loads and creates IPEDS data frame (population)
- Creates data frame of generated variables (population)
- Creates sample versions of the IPEDS and gnerated data frames
Note: code chunk omitted from html document using include = FALSE
Question 1
If we wanted to plot the distribution of the variable coa_grad_nres
from the data frame df_ipeds_pop
, we could use the user-written function called plot_distribution
like so:
plot_distribution(data_vec = df_ipeds_pop$coa_grad_nres, plot_title = "Distribution of cost of attendance for out-of-state graduate students")

- Does the distribution above have a normal, left-skewed, or right-skwed shape? Why?
ANSWER
The variable coa_grad_nres
appears to have a right-skewed distribution. It is right-skewed because the right tail is longer due to the presence of positive outliers and as such these outliers increase the value of the mean and therefore our mean is higher than our median (e.g., mean > median).
- What is the standard deviation and interpret this value in words?
ANSWER
The standard deviation is 10748.16 and it reads as follows: on average, observations are 10748 away from the mean of 34820.
Question 2
Run the following code, which does the following:
- Takes 1000 random samples of sample size n=20 from the data frame
df_ipeds_pop
.
- For each random sample, calculates the sample mean of variable
coa_grad_nres
.
- Plots the sampling distribution of the sample mean of variable
coa_grad_nres
.
set.seed(345)
get_sampling_distribution(data_vec = df_ipeds_pop$coa_grad_nres, num_samples = 1000, sample_size = 20) %>%
plot_distribution(plot_title = "Sampling Distribution of the Sample mean")

- What does each observation in the above plot represent?
ANSWER
A sample mean from one random sample
- Would you describe the shape of the above distribtuion as (approximately) normal, left-skwed, or right-skewed?
ANSWER
normal
Your turn
Run the following in the code chunk below:
- Take 1000 random samples of sample size n=200 from the data frame
df_ipeds_pop
.
- For each random sample, calculates the sample mean of variable
coa_grad_nres
.
- Plots the sampling distribution of the sample mean of variable
coa_grad_nres
.
SOLUTION
set.seed(345)
get_sampling_distribution(data_vec = df_ipeds_pop$coa_grad_nres, num_samples = 1000, sample_size = 200) %>%
plot_distribution(plot_title = "Sampling Distribution of the Sample mean")
- Why is the standard error from this sampling distribution (each sample has sample size n=200) different than the standard error from the sampling distribution from the previous example (each sample has sample size n=20)?
ANSWER
The standard error from the sampling distribution with a sample size of 20 is larger because the sample size for each random sample is smaller than the sample size for each random sample of size 200. The larger our sample size, the smaller our standard error is. Which means the smaller our standard error, the more precise our estimates.
Question 3
Run the following code, which does the following:
- Plots the population distribution of the variable
coa_grad_nres
- Plots the distribution of the variable
coa_grad_nres
from one sample
- Plots the sampling distribution of the sample mean for the variable
coa_grad_nres
set.seed(345)
plot_distribution(df_ipeds_pop$coa_grad_nres, plot_title = 'Population distribution') +
plot_distribution(df_ipeds_sample$coa_grad_nres, plot_title = 'Single sample distribution') +
plot_distribution(get_sampling_distribution(data_vec = df_ipeds_pop$coa_grad_nres, num_samples = 1000, sample_size = 200),plot_title = "sampling distribution of sample mean of coa_grad_nres") +
plot_layout(ncol = 1)

- State the central limit theorem in your own words and explain why it is important for hypothesis testing
ANSWER
The central limit theorem is important when conducting hypothesis tests about a population parameter (e.g., about a population mean, about a population regression coefficient), based on the sampling distribution of the relevant sample statistic. If the sampling distribution has a normal distribution, then we know the percent of the observations that we are a certain number of standard deviations from the mean.