GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/slides/10_introduction_to_inference.ipynb
²⁰⁵¹ views

Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 11 - Introduction to inference & sampling through simulation

What is statistical inference?

Statistical inference is the process of using a small sample to make conclusions about the wider population the sample came from

Examples of types of inference: estimation and testing

Things we can do with inference

1. Make a statement such as this:

Based on a the results of the latest poll, we estimate that 47.2% of Americans think that firearms should have strong regulations or restrictions when thinking about gun ownership rights and gun laws.

source: http://polling.reuters.com/#!response/PV20/type/smallest/dates/20180505-20181002/collapsed/true

This is estimation!

2. Answer a marketing question such as this:

What proportion of undergraduate students have an iphone?

source: https://media.wired.com/photos/5b22c5c4b878a15e9ce80d92/master/w_582,c_limit/iphonex-TA.jpg

This can be answered with estimation!

3. or a health question such as this:

Are first babies born later than non-first born babies?

source: https://images.mentalfloss.com/sites/default/files/styles/mf_image_16x9/public/baby_0.jpg

This can be answered with a hypothesis test!

4. or a A/B testing question such as this:

Which of the 2 website designs will lead to more customer engagement (measured by click-through-rate, for example)?

source: https://images.ctfassets.net/zw48pl1isxmc/4QYN7VubAAgEAGs0EuWguw/165749ef2fa01c1c004b6a167fd27835/ab-testing.png

This can be answered with a hypothesis test!

Estimation

What is estimation? And how do we do it?

Marketing example revisited

Question: What proportion of undergraduate students have an iphone?

How could we answer this question? Discuss with your neighbour.

What if we randomly selected a subset and then asked them if they have an iphone? We could then calculate a proportion that we could use as an estimate of the true population proportion (parameter)? Could this work?

Let's experiment and see how well sample estimates reflect the true population parameter we are interested in measuring!

Virtual sampling simulation

Let's create a virtual box of timbits (our population)
Let's each use R to:
- collect a random sample of 40 timbits,
- calculate a proportion of chocolate timbits
- add our proportion to this shared Google sheet to build a distribution of this sample statistic

source: https://insidetimmies.com/2014/05/20/tim-hortons-has-sold-400000-km-of-timbits-since-its-introduction-in-1976/

As always, load the libraries we'll be using:

In [4]:

# load libraries for wrangling and plotting
library(dplyr)
library(ggplot2)
library(infer) #install.packages("infer")

Out[4]:

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

1. Create a virtual box of timbits (population)

Let's "create" a population of 10000 timbits where the proportion of chocolate timbits is 0.63 and the proportion of old fashioned timbits is 0.37.

IMPORTANT - set your seed to 1234 so that we all create the same population!

In [5]:

# create virtual box
set.seed(1234)
virtual_box <- tibble(timbit_id = seq(1, 10000, by = 1),
                     color = factor(rbinom(10000, 1, 0.63),
                     labels = c("old fashioned", "chocolate")))
head(virtual_box)

Out[5]:

Here we use rbinom to "create" the population of 10000 timbits where the proportion of chocolate is 0.63 and the proportion of old fashioned is 0.37.

We also use seq to create a column called timbit_id that holds the value from 1 to 10000.

We use tibble to make keep these two columns together as a data frame (tibble is a special type of data frame that you will learn more about in the Data Wrangling course).

Sanity check that the virtual box contains ~ 63% chocolate timbits:

In [11]:

virtual_box %>% 
    group_by(color) %>% 
    summarize(n = n(),
             proportion = n() / 10000)

Out[11]:

2. Drawing a single sample of size 40

Let's simulate taking one random sample from our virtual timbits box. We will use the rep_sample_n function from the infer package:

In [21]:

# draw a single sample from the virtual box
set.seed(NULL)
samples_1 <- rep_sample_n(virtual_box, size = 40)
head(samples_1)

Out[21]:

We can tell by the timbit_id column that R indeed did what we asked - randomly selected 40 timbits from our virtual box.

What is the proportion of chocolate in our single sample?

In [10]:

choc_sample <- summarize(samples_1, n = sum(color == "chocolate"),
                                        prop = sum(color == "chocolate") / 40)
choc_sample

Out[10]:

summarize applies a a data transformation across the rows of a data frame (more about this in Data Wrangling)

Add our calculated proportion to the shared Google sheet

Google sheet

Now it's your turn! Go!

Collect a random sample of 40 timbits & calculate the sum and proportion of chocolate timbits (code given below).
add your proportion to this shared Google sheet to build a distribution of this sample statistic.

In [30]:

set.seed(1234) # so that we all have the same population
virtual_box <- tibble(timbit_id = seq(from = 1, to = 10000, by = 1),
                      color = factor(rbinom(10000, 1, 0.63), 
                                     levels = c(1, 0),
                                     labels = c("chocolate", "old fashioned")))
set.seed(NULL) # so that we each collect a different sample
choc_sample <- rep_sample_n(virtual_box, size = 40) %>% 
    summarize(n = sum(color == "chocolate"),
              prop = sum(color == "chocolate") / 40)
choc_sample

Out[30]:

Discussion time

How well do our samples represent the population parameter we are interested in (proportion of chocolate timbits)?

Back to our marketing example

Is randomly selecting a subset of the students (taking a single sample) and then asking them if they have an iphone a good way to estimate the true proportion of all undergraduates who have iphones (population parameter we are interested in)?

Wrap-up

What did we learn so far today? Let's make a list here!

Questions that we will try to answer next?

Usually we only have one sample? So what can we do?

Acknowledgements

Data Science in a box by Mine Cetinkaya-Rundel
Inference in 3 hours by Allan Downey
Modern Dive: An Introduction to Statistical and Data Sciences via R by Chester Ismay and Albert Y. Kim

DSCI 100 - Introduction to Data Science

Lecture 11 - Introduction to inference & sampling through simulation

What is statistical inference?

Things we can do with inference

1. Make a statement such as this:

This is estimation!

2. Answer a marketing question such as this:

This can be answered with estimation!

3. or a health question such as this:

This can be answered with a hypothesis test!

4. or a A/B testing question such as this:

This can be answered with a hypothesis test!

Estimation

Marketing example revisited

Virtual sampling simulation

As always, load the libraries we'll be using:

1. Create a virtual box of timbits (population)

2. Drawing a single sample of size 40

What is the proportion of chocolate in our single sample?

Add our calculated proportion to the shared Google sheet

Now it's your turn! Go!

Discussion time

Back to our marketing example

Wrap-up

Questions that we will try to answer next?

Acknowledgements

Product

Resources

Company