A/B Testing Hypothesis tests

Scenario

You are a data scientist working for a web-based company and you are planning to roll out a new site design soon. For random samples of users one of two competing designs were presented and the ultimate purchase total was recorded (if any).

Your task is to determine which of the two designs yields higher total purchases, and if the result is statistically significant.

In [ ]:

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns

%matplotlib inline

In [ ]:

# Generate some data and randomize

# Some people bought nothing, the others bought with some distribution
data1 = [0] * 50
data1.extend(stats.norm(12, 4).rvs(150))
np.random.shuffle(data1)

# The second design hooked less people, but those that were hooked bought more stuff
data2 = [0] * 100
data2.extend(stats.norm(20, 5).rvs(100))
np.random.shuffle(data2)

# Make a data frame
df = pd.DataFrame()
df["A"] = data1
df["B"] = data2

df.head()

In [ ]:

# Let's plot the data first
plt.hist(df["A"], bins=50, label="A")
plt.ylabel("A counts")
plt.xlabel("Total Purchase")

# Exercise: Make the same plot for dataset "B"

In [ ]:

# Seaborn is also useful

sns.violinplot(data=df)
plt.ylabel("Total Purchase")
sns.plt.show()

Exercises

In the next cell, use pandas to make box plots from the data frame. Feel free to Google around if you can't remember how to do this.

Do you like the boxplots or violin plots better?

Are our datasets normal, or approximately so? Use what we learned in the previous lesson to decide. Hint: You need to compute a statistical value for each of the two data sets.

In [ ]:

# Work out the exercises here

Statistical Tests

There are a few good statistical tests for A/B testing:

Each test makes various assumptions:

ANOVA assumes normal distributions and equal variances in the two data sets
The Welch t-test assumes normal distributions but not necessarily equal variances, and accounts for small sample sizes better
The Mann-Whitney test assumes nothing about the distributions but requires at least 20 data points in each set, and produces a weaker p-value

Typically you need to choose the most appropriate test. Tests that make more assumptions are more discriminating (stronger p-values) but can be misleading on data sets that don't satisfy the assumptions.

Which test is most appropriate for our data?

In [ ]:

All of these tests are available in the scipy library, a stats library for python:

In [ ]:

# For example, to use ANOVA
statistic, pvalue = stats.f_oneway(df["A"], df["B"])
print pvalue

In [ ]:

# Use the Mann-Whitney test on our data
# look up the function in scipy from the link above

Interpret your results

Is there a significant difference in the mean total purchases in the two designs?
Which design do you recommend and why? Write two sentences explaining your results and your recommendation.

Your explanation here.

Final Exercise

Your results may or may not have been significant depending on the randomness of the data set. Try running the notebook a few more times to see if anything changes. You can do this in one step by using the Kernal menu and selecting "Restart & Run All".

If you can't get the opposite outcome, try altering the distributions of the data sets (change the variance in the second argument) until you get the other outcome.

In [ ]:

A/B Testing Hypothesis tests

Scenario

Exercises

Statistical Tests

Interpret your results

Final Exercise

Product

Resources

Company