Path: blob/master/2020-spring/materials/worksheet_11/worksheet_11.ipynb
2051 views
Worksheet 11 - Introduction to statistical inference
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
Describe real world examples of questions that can be answered with the statistical inference methods.
Name common population parameters (e.g., mean, proportion, median, variance, standard deviation) that are often estimated using sample data, and use computation to estimate these.
Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution).
Explain the difference between a population parameter and sample point estimate.
Use computation to draw random samples from a finite population.
Use computation to create a sampling distribution from a finite population.
Describe how sample size influences the sampling distribution.
Question 1.0 Multiple Choice:
{points: 1}
In which of the following questions would inferential methods (e.g., estimation or hypothesis testing) be appropriate?
A. Does treating a corn crop with Roundup cause greater yields compared to corn crops that are not treated with pesticides in Saskatchewan?
B. Are yields of corn crops which are treated with Roundup different than corn crops which are not treated with pesticides in Saskatchewan?
C. What will be the yield of a corn crop in Saskatchewan if we treat it with Roundup next year?
D. Are yields of corn crops which are treated with Roundup different than corn crops which are not treated with pesticides in the data set collected from the Rural Municipality of Cymri No. 36 in Saskatchewan?
Assign your answer to an object called answer1.0
. Your answer should be a single character surrounded by quotes.
Question 1.1 Matching:
{points: 1}
Read the mixed up table below and assign the variables in the code cell below a number to match the the term to it's correct definition. Do not put quotations around the number or include words in the answer, we are expecting the assigned values to be numbers.
Terms | Definitions |
---|---|
point estimate | 1. the entire set of entities/objects of interest |
population | 2. selecting a subset of observations from a population where each observation is equally likely to be selected at any point during the selection process |
random sampling | 3. a numerical summary value about the population |
representative sampling | 4. a distribution of point estimates, where each point estimate was calculated from a different random sample from the same population |
population parameter | 5. a collection of observations from a population |
sample | 6. a single number calculated from a random sample that estimates an unknown population parameter of interest |
observation | 7. selecting a subset of observations from a population where the sample’s characteristics are a good representation of the population’s characteristics |
sampling distribution | 8. a quantity or a quality (or set of these) we collect from a given entity/object |
Virtual sampling simulation
In real life, we rarely, if ever, have measurements for our entire population. Here, however, we will pretend that we somehow were able to ask every single Candian senior what their age is. We will do this so that we can experiment to learn about sampling and how this relates to estimation.
Here's a human readable scenario that we will create a population for:
Here we make a simulated dataset of ages for our population (all Canadian seniors) bounded by realistic values ( 65 and 118):
Question 1.2
{points: 1}
A distribution defines all the possible values (or intervals) of the data and how often they occur. Visualize the distribution of the population (can_seniors
) that was just created by plotting a histogram using binwidth = 1
in the geom_histogram
argument. Name the plot pop_dist
and give x axis a descriptive label.
Question 1.3
{points: 1}
Distributions are complicated to communicate, thus we often want to represent them by a single value or small number of values. Common values used for this include the mean, the mean, median, standard deviation, etc).
Use summarise
to calculate the following population parameters from the can_seniors
population:
mean (use the
mean
function)median (use the
median
function)standard deviation (use the
sd
function)
Name this data frame pop_parameters
which has the column names pop_mean
, pop_med
and pop_sd
.
Question 1.4
{points: 1}
In real life, we usually are able to only collect a single sample from the population. We use that sample to try to infer what the population looks like.
Take a single random sample of 40 observations from the Canadian seniors population (can_seniors
). Name it sample_1
. Use 4321 as your seed.
Question 1.5
{points: 1}
Visualize the distribution of the random sample you just took (sample_1
) that was just created by plotting a histogram using binwidth = 1
in the geom_histogram
argument. Name the plot sample_1_dist
and give the plot (using ggtitle
) and the x axis a descriptive label.
Question 1.6
{points: 1}
Use summarise
to calculate the following point estimates from the random sample you just took (sample_1
):
mean
median
standard deviation
Name this data frame sample_1_estimates
which has the column names sample_1_mean
, sample_1_med
and sample_1_sd
.
Let's now compare our random sample to the population from where it was drawn:
And now let's compare the point estimates (mean, median and standard deviation) with the true population parameters we were trying to estimate:
Question 1.7 Multiple Choice
{points: 1}
After comparing the population and sample distributions above, and the true population parameters and the sample point estimates, which statement below is not correct:
A. The sample point estimates are close to the values for the true population parameters we are trying to estimate
B. The sample distribution is of a similar shape to the population distribution
C. The sample point estimates are identical to the values for the true population parameters we are trying to estimate
Assign your answer to an object called answer1.7
. Your answer should be a single character surrounded by quotes.
Question 1.8.0
{points: 0}
What if we took another sample? What would we expect? Let's try! Take another random sample from population (use a different random seed this time so that you get a different sample), and visualize its distribution and calculate the point estimates for the sample mean, median and standard deviation:
Question 1.8.1
{points: 1}
After comparing the distribution and point estimates of this second random sample from the population with that of the first random sample and the population, which statements below is not correct:
A. The sample distributions from different random samples are of a similar shape to the population distribution, but they vary a bit depending which values are captured in the sample
B. The sample point estimates from different random samples are close to the values for the true population parameters we are trying to estimate, but they vary a bit depending which values are captured in the sample
C. Every random sample from the same population should have an indentical distribution and yield identical point estimates.
Assign your answer to an object called answer1.8.1
. Your answer should be a single character surrounded by quotes.
Exploring the sampling distribution of an estimate
Just how much should we expect the point estimates of our random samples to vary? To build an intuition for this, let's experiment a little more with our population of Canadian seniors. To do this we will take 1500 random samples, and then calculate the point estimate we are interested in (let's choose the mean for this example) for each sample. Finally, we will visualize the distribution of the sample point estimates. This distribution will tell us how much we would expect the point estimates of our random samples to vary for this population for samples of size 40 (the size of our samples).
Question 1.9
{points: 1}
Draw 1500 random samples from our population of Canadian seniors (can_seniors
). Each sample should have 40 observations. Name the data frame samples
and use the seed 4321
.
Question 2.0
{points: 1}
Group by the sample replicate number, and then for each sample, calculate the mean as the point estimate. Name the data frame sample_estimates
. The data frame should have the column names replicate
and sample_mean
.
Question 2.1
{points: 1}
Visualize the distribution of the sample estimates (sample_estimates
) you just calculated by plotting a histogram using binwidth = 1
in the geom_histogram
argument. Name the plot sampling_distribution
and give the plot (using ggtitle
) and the x axis a descriptive label.
Question 2.2
{points: 1}
Let's refresh our memories, what is the mean age of the population (we calculated this above). Assign your answer to an object called answer2.2
. Your answer should be a single number.
Question 2.3 Multiple Choice
{points: 1}
Considering the true value for the population mean, and the sampling distribution you created and visualized in question 2.1, which statement below is not correct:
A. The sampling distribution is centered at the true population mean
B. All the sample means are the same value as the true population mean
C. Most sample means are at or very near the same value as the true population mean
D. Few sample means are far away from the same value as the true population mean
Assign your answer to an object called answer2.3
. Your answer should be a single character surrounded by quotes.
Question 2.4 True/False
{points: 1}
Taking a random sample and calculating a point estimate is a good way to get a "best guess" of the population parameter you are interested in. True or False?
Assign your answer to an object called answer2.4
. Your answer should be a either "True" or "False", surrounded by quotes.
The influence of sample size on the sampling distribution
What happens to our point estimate when we change the sample size? Let's answer this question by experimenting! We will create 3 different sampling distributions of sample means, each using a different sample size. As we did above, we will draw samples from our Canadian seniors population. We will visualize these sampling distributions and see if we can see a pattern when we vary the sample size.
Question 2.5
{points: 1}
Using the same strategy as you did above, draw 1500 random samples from the Canadian seniors population (can_seniors
), each of size 20. For each sample, calculate the mean age. Then visualize the distribution of the sample estimates (means) you just calculated by plotting a histogram using binwidth = 1
in the geom_histogram
argument. Name the plot sampling_distribution_20
and give the x axis a descriptive label. Give the plot the title "n = 20". Also specify the x-axis limits to be 65 and 95 using xlim(c(65, 95))
.
Set the seed as 4321 when you collect your samples.
Question 2.6
{points: 1}
Using the same strategy as you did above, draw 1500 random samples from the Canadian seniors population (can_seniors
), each of size 100. For each sample, calculate the mean age. Then visualize the distribution of the sample estimates (means) you just calculated by plotting a histogram using binwidth = 1
in the geom_histogram
argument. Name the plot sampling_distribution_100
and give the x axis a descriptive label. Give the plot the title "n = 100". Also specify the x-axis limits to be 65 and 95 using xlim(c(65, 95))
.
Set the seed as 4321 when you collect your samples.
Question 2.7
{points: 0}
Fill in the blanks in the code below to use grid.arrange
to plot the three sampling distributions side-by-side. Order them from smallest sample size on the left, to largest sample size on the right. Name the final panel figure sampling_distribution_panel
.
Question 2.8 Multiple Choice
{points: 1}
Considering the panel figure you created above in question 2.7, which statement below is not correct:
A. As the sample size increases, the sampling distribution of the point estimate becomes narrower.
B. As the sample size increases, more sample point estimates closer to the true population mean.
C. As the sample size decreses, the sample point estimates become more variable.
D. As the sample size increases, the sample point estimates become more variable.
Assign your answer to an object called answer2.8
. Your answer should be a single character surrounded by quotes.
Question 2.9 True/False
{points: 1}
Given what you observed above, and considering the real life scenario where you will only have one sample, answer the True/False question below:
The smaller your random sample, the better your sample point estimate reflect the true population parameter you are trying to estimate. True or False?
Assign your answer to an object called answer2.9
. Your answer should be a either "True" or "False", surrounded by quotes.