GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2020-spring/materials/tutorial_11/tutorial_11.ipynb
²⁰⁵¹ views

Kernel: R

Tutorial 11 - Introduction to statistical inference

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

Describe real world examples of questions that can be answered with the statistical inference methods.
Name common population parameters (e.g., mean, proportion, median, variance, standard deviation) that are often estimated using sample data, and use computation to estimate these.
Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution).
Explain the difference between a population parameter and sample point estimate.
Use computation to draw random samples from a finite population.
Use computation to create a sampling distribution from a finite population.
Describe how sample size influences the sampling distribution.

In [ ]:

### Run this cell before continuing.
library(tidyverse)
library(repr)
library(digest)
library(infer)
library(gridExtra)
source('tests_tutorial_11.R')
source('cleanup_tutorial_11.R')

Virtual sampling simulation

In this tutorial you will study samples and sample means generated from different distributions. In real life, we rarely, if ever, have measurements for our entire population. Here, however, we will make simulated datasets so we can understand the behaviour of sample means.

Suppose we had the data science final grades for a large population of students.

In [ ]:

# run this cell to simulate a finite population
set.seed(20201) # DO NOT CHANGE
students_pop <- tibble(grade = (rnorm(mean = 70, sd = 8, n = 10000)))
head(students_pop)

Question 1.0
{points: 1}

Visualize the distribution of the population (students_pop) that was just created by plotting a histogram using binwidth = 1 in the geom_histogram argument. Name the plot pop_dist and give x axis a descriptive label.

In [ ]:

options(repr.plot.width = 4, repr.plot.height = 3)
#pop_dist <- ggplot(..., ...) + 
#    geom_...(...) +
#    ... +
#    ggtitle("Population distribution")

# your code here
fail() # No Answer - remove if you provide an answer
pop_dist

In [ ]:

test_1.0()

Question 1.1
{points: 3}

Describe in words the distribution above, comment on the shape, center and how spread out the distribution is.

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Question 1.2
{points: 1}

Use summarise to calculate the following population parameters from the students_pop population:

mean (use the mean function)
median (use the median function)
standard deviation (use the sd function)

Name this data frame pop_parameters which has the column names pop_mean, pop_med and pop_sd.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
pop_parameters

In [ ]:

test_1.2()

Exploring the sampling distribution of the sample mean for different populations

We will create the sampling distribution of the sample mean by taking 1500 random samples of size 5 from this population and visualize the distribution of the sample means.

Question 1.3
{points: 1}

Draw 1500 random samples from our population of students (students_pop). Each sample should have 5 observations. Name the data frame samples and use the seed 4321.

In [ ]:

#samples <- rep_sample_n(..., size = ..., reps = ...)
set.seed(4321) # DO NOT CHANGE!
# your code here
fail() # No Answer - remove if you provide an answer
head(samples)
tail(samples)
dim(samples)

In [ ]:

test_1.3()

Question 1.4
{points: 1}

Group by the sample replicate number, and then for each sample, calculate the mean. Name the data frame sample_estimates. The data frame should have the column names replicate and sample_mean.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(sample_estimates)
tail(sample_estimates)

In [ ]:

test_1.4()

Question 1.5
{points: 1}

Visualize the distribution of the sample estimates (sample_estimates) you just calculated by plotting a histogram using binwidth = 1 in the geom_histogram argument. Name the plot sampling_distribution and give the plot (using ggtitle) and the x axis a descriptive label.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
sampling_distribution_5

In [ ]:

test_1.5()

Question 1.6
{points: 3}

Describe in words the distribution above, comment on the shape, center and how spread out the distribution is. Compare this sampling distribution to the population distribution of students' grades above.

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Question 1.7
{points: 1}

Let's create a simulated dataset of the number of cups of coffee drunk per week for our population of students. Describe in words the distribution, comment on the shape, center and how spread out the distribution is.

In [ ]:

# run this cell to simulate a finite population
coffee_data = tibble(cups = c(
  rep(0, 166),  rep(1 ,45),
  rep(2,43),  rep(3,29),
  rep(4, 17), rep(5, 17),
  rep(6, 5),  rep(7, 17),
  rep(8, 8),  rep(9, 3),
  rep(10, 13),  rep(11, 1),
  rep(12, 0),  rep(13, 0),
  rep(14, 4),  rep(15, 1),
  rep(16, 1),  rep(21, 5)))

pop_dist <- ggplot(coffee_data, aes(cups)) + 
    geom_histogram(binwidth = 1) +
    xlab("Cups of coffee per week") +
    ggtitle("Population distribution") 
pop_dist

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Question 1.8
{points: 1} Repeat the steps in questions 1.3 - 1.5 with sample size 5, for this coffee population. You should end up with a plot of the sampling distribution called sampling_distribution_5.

In [ ]:

set.seed(4321) # DO NOT CHANGE!

# your code here
fail() # No Answer - remove if you provide an answer
sampling_distribution_5

In [ ]:

test_1.8()

Question 1.9
{points: 3}

Describe in words the distribution above, comment on the shape, center and how spread out the distribution is. Compare this sampling distribution to the population distribution above.

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Question 2.0
{points: 1}

Repeat the steps in questions 1.3 - 1.5 with sample size 30, for this coffee population. You should end up with a plot of the sampling distribution called sampling_distribution_30.

In [ ]:

set.seed(4321) # DO NOT CHANGE!

# your code here
fail() # No Answer - remove if you provide an answer
sampling_distribution_30

In [ ]:

test_2.0()

Question 2.1
{points: 3}

Describe in words the distribution above, comment on the shape, center and how spread out the distribution is. Compare this sampling distribution with samples of size 30 to the sampling distribution with samples of size 5.

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Tutorial 11 - Introduction to statistical inference

Lecture and Tutorial Learning Goals:

Virtual sampling simulation

Exploring the sampling distribution of the sample mean for different populations

Product

Resources

Company