HYPOTHESIS TESTING
Table of Contents
1.Introduction
Data Science is a rapidly developing field where people are coming from different industries with backgrounds in Computer Science, Physics, Statistics, and Biology. So depending on a background, people can be less familiar with statistical tools. Working with data can be very dangerous because you may end up with wrong findings and make the wrong decision. So it’s always a good idea to check your results for statistical significance (Hypothesis Testing).
Defining Hypothesis
Hypotheis is a statement to be tested and It is all about the nature of thought and belief with propositional content and making Statements on basis of that.
Intiallialy following assumptions are made:
Null Hypothesis (H0, read "H zero"): states that all things remain equal. No phenomena is observed or there is not relationship between what you are comparing
Alternative Hypothesis (H1, read “H one”): states the opposite of the Null Hypothesis. That there was some change, or observed relationship between what you are comparing
A/B testing is a case of hypothesis testing (also known as split testing or bucket testing) is a method of comparing two versions of a webpage or app against each other to determine which one performs better.
A Yuva limited Company (fictional),wants to test whether their among two website designs (old & new),which one is more liked by users?.
We will be working to understand the results of an A/B test( on landing page) through run by an company website through hypothesis testing. This involves comparing new & old versions of a landing page, known as variation A and B, to see which performs better.
The objective is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.
So lets understand the result of A/B test ,through a statistical analysis (hypothesis testing) to choose better design.
2.1 Hypothesis Statements
These questions are the difficult parts associated with A/B tests in general.
For now, we need to make the decision just based on all the data provided. If we want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be? we can state your hypothesis in terms of words or in terms of and , which are the converted rates for the old and new pages.
Importing Required Packages
Read in the dataset and take a look at the top few rows here:
The number of unique users in the dataset
The proportion of users converted
The number of times the new_page and treatment don't line up
For the rows where treatment is not aligned with new_page or control is not aligned with old_page, we cannot be sure if this row truly received the new or old page.
Do any of the rows have missing values?
How many unique user_ids in new page landing page?
Check for duplicate user ID in among new users & specify its row information?
Remove one of the rows with a duplicate user_id
Probability
What is the probability of an individual converting regardless of the page they receive?
Given that an individual was in the treatment group, what is the probability they converted?
What is the probability that an individual received the new page?
Given that an individual was in the control group, what is the probability they converted?
When we compare the conversion rate of new page with the conversion rate of old page, we see that the conversion rate falls down from 0.1196 to 0.1188.
Insights
The probability of an individual converting regardless of the page they receive is 11.96%.
Given that an individual was in the control group, the probability they converted is 12.04%
Given that an individual was in the treatment group, the probability they converted is 11.88%.
The probablity users converted in both control and treatment group are pretty similar to each other and probability of an individual converting regardless of the page they receive
Now we are looking at a null where there is no difference in conversion based on the page, which means the conversions for each page are the same
a.What is the convert rate for (New Page) under the null?
b.What is the convert rate for (Old Page) under the null?
c.What is (Unique users in new page)?
d.What is (Unique users in old page)?
e.Simulate transactions with a convert rate of under the null. Store these 1's and 0's in new_page_converted
f.Simulate transactions with a convert rate of under the null. Store these 1's and 0's in old_page_converted
g.Find - for your simulated values from above
h.Simulate 10,000 - values using this same process similarly to the one you calculated in parts a. through g.above. Store all 10,000 values in p_diffs.
Plot a histogram of the p_diffs. Does this plot look like what you expected? Use the matching problem in the classroom to assure you fully understand what was computed here?
Let n_old and n_new refer the the number of rows associated with the old page and new pages, respectively
The p-value of 0.09 above shows that we fail to reject our null hyptothesis in favor of the alternative
Earlier we chose our alternative hypothesis that the new page drives more conversion.
Then our evaluation of p-value suggested that we cannot reject the null hypothesis.
We then used built-in z-test to see the statistical significance. The p-value from z-test also suggested that there is not much evidence to reject the null.
Inference : We cannot reject null hypothesis