In-class exercise: Distributions and Sampling distributions

Instructions

We are going to work through some concepts discussed in the previous week’s lecture (e.g. sampling distribution, central limit theorem, working with user-written functions).

Task

Please run the code in the following chunk, which does the following:

Loads libraries
Loads and creates IPEDS data frame (population)
Creates data frame of generated variables (population)
Creates sample versions of the IPEDS and gnerated data frames

Note: code chunk omitted from html document using include = FALSE

Question 1

If we wanted to plot the distribution of the variable coa_grad_nres from the data frame df_ipeds_pop, we could use the user-written function called plot_distribution like so:

plot_distribution(data_vec = df_ipeds_pop$coa_grad_nres, plot_title = "Distribution of cost of attendance for out-of-state graduate students")

Does the distribution above have a normal, left-skewed, or right-skwed shape? Why?

ANSWER

The variable coa_grad_nres appears to have a right-skewed distribution. It is right-skewed because the right tail is longer due to the presence of positive outliers and as such these outliers increase the value of the mean and therefore our mean is higher than our median (e.g., mean > median).
What is the standard deviation and interpret this value in words?

ANSWER

The standard deviation is 10748.16 and it reads as follows: on average, observations are 10748 away from the mean of 34820.

Question 2

Run the following code, which does the following:

Takes 1000 random samples of sample size n=20 from the data frame df_ipeds_pop.
For each random sample, calculates the sample mean of variable coa_grad_nres.
Plots the sampling distribution of the sample mean of variable coa_grad_nres.

set.seed(345)
get_sampling_distribution(data_vec = df_ipeds_pop$coa_grad_nres, num_samples = 1000, sample_size = 20) %>%
  plot_distribution(plot_title = "Sampling Distribution of the Sample mean")

What does each observation in the above plot represent?

ANSWER

A sample mean from one random sample
Would you describe the shape of the above distribtuion as (approximately) normal, left-skwed, or right-skewed?

ANSWER
normal

Your turn

Run the following in the code chunk below:

Take 1000 random samples of sample size n=200 from the data frame df_ipeds_pop.
For each random sample, calculates the sample mean of variable coa_grad_nres.
Plots the sampling distribution of the sample mean of variable coa_grad_nres.

SOLUTION

set.seed(345)
get_sampling_distribution(data_vec = df_ipeds_pop$coa_grad_nres, num_samples = 1000, sample_size = 200) %>%
  plot_distribution(plot_title = "Sampling Distribution of the Sample mean")

Why is the standard error from this sampling distribution (each sample has sample size n=200) different than the standard error from the sampling distribution from the previous example (each sample has sample size n=20)?

ANSWER

The standard error from the sampling distribution with a sample size of 20 is larger because the sample size for each random sample is smaller than the sample size for each random sample of size 200. The larger our sample size, the smaller our standard error is. Which means the smaller our standard error, the more precise our estimates.

Question 3

Run the following code, which does the following:

Plots the population distribution of the variable coa_grad_nres
Plots the distribution of the variable coa_grad_nres from one sample
Plots the sampling distribution of the sample mean for the variable coa_grad_nres

set.seed(345)
plot_distribution(df_ipeds_pop$coa_grad_nres, plot_title = 'Population distribution') +
  plot_distribution(df_ipeds_sample$coa_grad_nres, plot_title = 'Single sample distribution') +
  plot_distribution(get_sampling_distribution(data_vec = df_ipeds_pop$coa_grad_nres, num_samples = 1000, sample_size = 200),plot_title = "sampling distribution of sample mean of coa_grad_nres") +
  plot_layout(ncol = 1)

State the central limit theorem in your own words and explain why it is important for hypothesis testing

ANSWER

The central limit theorem is important when conducting hypothesis tests about a population parameter (e.g., about a population mean, about a population regression coefficient), based on the sampling distribution of the relevant sample statistic. If the sampling distribution has a normal distribution, then we know the percent of the observations that we are a certain number of standard deviations from the mean.