Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ethen8181
GitHub Repository: ethen8181/machine-learning
Path: blob/master/ab_tests/causal_inference/diff_in_diff.ipynb
2586 views
Kernel: Python 3 (ipykernel)
# code for loading the format for the notebook import os # path : store the current path to convert back to it later path = os.getcwd() os.chdir(os.path.join('..', '..', 'notebook_format')) from formats import load_style load_style(plot_style=False)
os.chdir(path) # 1. magic for inline plot # 2. magic to print version # 3. magic so that the notebook will reload external python modules # 4. magic to enable retina (high resolution) plots # https://gist.github.com/minrk/3301035 %matplotlib inline %load_ext watermark %load_ext autoreload %autoreload 2 %config InlineBackend.figure_format='retina' import os import time import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.formula.api as smf %watermark -a 'Ethen' -d -u -v -iv
Author: Ethen Last updated: 2022-06-05 Python implementation: CPython Python version : 3.7.11 IPython version : 7.27.0 matplotlib : 3.4.3 pandas : 1.3.5 statsmodels: 0.13.1 numpy : 1.21.6

Difference in Difference

In advertising or marketing domain, advertisers are very much interested in knowing how do ads campaigns or marketing efforts effect on products' sales. A baseline approach to conduct these type of analysis is a pre and post comparison. i.e. We have a time window before launching the treatment, a pre-period, and a time window after launching the treatment, a post-period. We would then calculate our metric of interest between the two time window and calculate its difference. One drawback with this baseline approach is that, how do we tease out the effect of our treatment from other potential factors, such as marketplace trend. i.e. How do we know an increase or decrease in product sales is because of our ad/marketing campaign, or is it because of some seasonal trends that is influencing our products' sales.

Difference in Difference, or diff in diff in short, computes average treatment effect on the treated by comparing treatment group difference to control group difference, where difference refers to the difference between our two time window.

(E[Y(1)∣T=1]−E[Y(1)∣T=0])−(E[Y(0)∣T=1]−E[Y(0)∣T=0])\begin{align} (E[Y(1)|T=1] - E[Y(1)|T=0]) - (E[Y(0)|T=1] - E[Y(0)|T=0]) \end{align}

Where, TT denote the treatment indicator, and Y(W)Y(W) denotes the outcome at time window WW.

Using the example from Notebook: Python Causality Handbook - Difference in Difference, say we wish to estimate whether billboard marketing increases deposits into our saving accounts. In our dataset, deposits is our outcome variable of interest, poa is a dummy indicator for the city of Porto Alegre, zero for Floraianopolis. jul is a dummy indicator for the month of July. Here we launched a billboard marketing campaign in the city of Porto Alegre, in other words, poa, is our treatment indicator, and jul is the pre-post time period indicator.

data = pd.read_csv("data/billboard_impact.csv") print(data.shape) data.head()
(4600, 3)

We can calculate difference in difference by slicing our datsets into 4 segments, and calculating their mean differences. The difference and difference mentioned here is also referred to as 2x2 difference in difference, meaning there are two comparison groups, and our treatment occurs at a single point in time, splitting time into a pre and post period.

treatment_pre = data.loc[(data['poa'] == 1) & (data['jul'] == 0), 'deposits'].mean() treatment_post = data.loc[(data['poa'] == 1) & (data['jul'] == 1), 'deposits'].mean() control_pre = data.loc[(data['poa'] == 0) & (data['jul'] == 0), 'deposits'].mean() control_post = data.loc[(data['poa'] == 0) & (data['jul'] == 1), 'deposits'].mean() (treatment_post - control_post) - (treatment_pre - control_pre)
6.524557692307695

Result is telling us that we should expect deposits to increase by $6.52 per customer given the marketing campaign.

Linear Regression

We can actually re-frame this type of difference in difference estimator with a linear regression that uses interaction variables.

Yi=β0+β1Ti+β2Wi+β3Ti∗Wi+ei\begin{align} Y_i = \beta_0 + \beta_1 T_i + \beta_2 W_i + \beta_3 T_i*W_i + e_i \end{align}

Where, β1\beta_1 is increment we get for going from control to treatment, β2\beta_2 is increment we get for going from pre to post period, and β3\beta_3 is increment for both effects, i.e. it is the difference in difference estimator. All 4 scenarios are also listed below:

  • Pre, Control. Ti=0T_i = 0, Wi=0W_i = 0, Ti∗Wi=0T_i * W_i = 0

  • Pre, Treatment. Ti=1T_i = 1, Wi=0W_i = 0, Ti∗Wi=0T_i * W_i = 0

  • Post, Control. Ti=0T_i = 0, Wi=1W_i = 1, Ti∗Wi=0T_i * W_i = 0

  • Post, Treatment. Ti=1T_i = 1, Wi=1W_i = 1, Ti∗Wi=1T_i * W_i = 1

smf.ols('deposits ~ poa * jul', data=data).fit().summary().tables[1]

Our result indicated by column poa:jul should match the hand rolled calculation in the previous section.

Pros: Difference in difference has an intuitive explanation. Also by converting this problem into the form of linear regression, we get the benefit of: 1) Additional stats such as p-values, confidence intervals reported. 2) Not needing to segment the data ourselves to perform this type of calculation. This also has the benefit of when we wish to extend this vanilla framework to include other potential confounders that are lurking in our observational data, it would just be adding another feature/field to our linear regression model.

Caveats: Vanilla version of this method comes with a caveat that parallel trend assumption must hold. Meaning, in absence of a treatment, difference between our treatment and control group should be the same across the two time period. If this assumption holds true, then we can use difference in difference to tease out the treatment effect. On the other hand, if it doesn't hold true, then this method will be biased. This is very common when the treatment allocation is determined by baseline outcome, e.g. we decided to run our marketing campaign in one group because it isn't performing well in the first place.

Quickest way to confirm this parallel trend assumption, is to perform a visual inspection.

plt.figure(figsize=(8, 6)) plt.plot(["May", "Jul"], [control_pre, control_post], label="control", lw=2) plt.plot(["May", "Jul"], [treatment_pre, treatment_post], label="treatment", lw=2) plt.legend() plt.show()
Image in a Jupyter notebook

Reference