CoCalc -- BasicStatisticalTesting.ipynb

GitHub Repository: ycchen00/Introduction-to-Data-Science-in-Python
Path: blob/main/resources/week-4/BasicStatisticalTesting.ipynb
³²²³ views

Kernel: Python 3

In this lecture we're going to review some of the basics of statistical testing in python. We're going to talk about hypothesis testing, statistical significance, and using scipy to run student's t-tests.

In [1]:

# We use statistics in a lot of different ways in data science, and on this lecture, I want to refresh your
# knowledge of hypothesis testing, which is a core data analysis activity behind experimentation. The goal of
# hypothesis testing is to determine if, for instance, the two different conditions we have in an experiment 
# have resulted in different impacts

# Let's import our usual numpy and pandas libraries
import numpy as np
import pandas as pd

# Now let's bring in some new libraries from scipy
from scipy import stats

In [2]:

# Now, scipy is an interesting collection of libraries for data science and you'll use most or perpahs all of
# these libraries. It includes numpy and pandas, but also plotting libraries such as matplotlib, and a
# number of scientific library functions as well

In [3]:

# When we do hypothesis testing, we actually have two statements of interest: the first is our actual
# explanation, which we call the alternative hypothesis, and the second is that the explanation we have is not
# sufficient, and we call this the null hypothesis. Our actual testing method is to determine whether the null
# hypothesis is true or not. If we find that there is a difference between groups, then we can reject the null
# hypothesis and we accept our alternative.

# Let's see an example of this; we're going to use some grade data
df=pd.read_csv ('datasets/grades.csv')
df.head()

Out[3]:

In [4]:

# If we take a look at the data frame inside, we see we have six different assignments. Lets look at some
# summary statistics for this DataFrame
print("There are {} rows and {} columns".format(df.shape[0], df.shape[1]))

Out[4]:

There are 2315 rows and 13 columns

In [5]:

# For the purpose of this lecture, let's segment this population into two pieces. Let's say those who finish
# the first assignment by the end of December 2015, we'll call them early finishers, and those who finish it 
# sometime after that, we'll call them late finishers.

early_finishers=df[pd.to_datetime(df['assignment1_submission']) < '2016']
early_finishers.head()

Out[5]:

In [6]:

# So, you have lots of skills now with pandas, how would you go about getting the late_finishers dataframe?
# Why don't you pause the video and give it a try.

In [7]:

# Here's my solution. First, the dataframe df and the early_finishers share index values, so I really just
# want everything in the df which is not in early_finishers
late_finishers=df[~df.index.isin(early_finishers.index)]
late_finishers.head()

Out[7]:

In [8]:

# There are lots of other ways to do this. For instance, you could just copy and paste the first projection
# and change the sign from less than to greater than or equal to. This is ok, but if you decide you want to
# change the date down the road you have to remember to change it in two places. You could also do a join of
# the dataframe df with early_finishers - if you do a left join you only keep the items in the left dataframe,
# so this would have been a good answer. You also could have written a function that determines if someone is
# early or late, and then called .apply() on the dataframe and added a new column to the dataframe. This is a
# pretty reasonable answer as well.

In [9]:

# As you've seen, the pandas data frame object has a variety of statistical functions associated with it. If
# we call the mean function directly on the data frame, we see that each of the means for the assignments are
# calculated. Let's compare the means for our two populations

print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

Out[9]:

74.94728457024303
74.0450648477065

In [10]:

# Ok, these look pretty similar. But, are they the same? What do we mean by similar? This is where the
# students' t-test comes in. It allows us to form the alternative hypothesis ("These are different") as well
# as the null hypothesis ("These are the same") and then test that null hypothesis.

# When doing hypothesis testing, we have to choose a significance level as a threshold for how much of a
# chance we're willing to accept. This significance level is typically called alpha. #For this example, let's
# use a threshold of 0.05 for our alpha or 5%. Now this is a commonly used number but it's really quite
# arbitrary.

# The SciPy library contains a number of different statistical tests and forms a basis for hypothesis testing
# in Python and we're going to use the ttest_ind() function which does an independent t-test (meaning the
# populations are not related to one another). The result of ttest_index() are the t-statistic and a p-value.
# It's this latter value, the probability, which is most important to us, as it indicates the chance (between
# 0 and 1) of our null hypothesis being True.

# Let's bring in our ttest_ind function
from scipy.stats import ttest_ind

# Let's run this function with our two populations, looking at the assignment 1 grades
ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])

Out[10]:

Ttest_indResult(statistic=1.322354085372139, pvalue=0.1861810110171455)

In [11]:

# So here we see that the probability is 0.18, and this is above our alpha value of 0.05. This means that we
# cannot reject the null hypothesis. The null hypothesis was that the two populations are the same, and we
# don't have enough certainty in our evidence (because it is greater than alpha) to come to a conclusion to
# the contrary. This doesn't mean that we have proven the populations are the same.

In [12]:

# Why don't we check the other assignment grades?
print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))

Out[12]:

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)

In [13]:

# Ok, so it looks like in this data we do not have enough evidence to suggest the populations differ with
# respect to grade. Let's take a look at those p-values for a moment though, because they are saying things
# that can inform experimental design down the road. For instance, one of the assignments, assignment 3, has a
# p-value around 0.1. This means that if we accepted a level of chance similarity of 11% this would have been
# considered statistically significant. As a research, this would suggest to me that there is something here
# worth considering following up on. For instance, if we had a small number of participants (we don't) or if
# there was something unique about this assignment as it relates to our experiment (whatever it was) then
# there may be followup experiments we could run.

In [14]:

# P-values have come under fire recently for being insuficient for telling us enough about the interactions
# which are happening, and two other techniques, confidence intervalues and bayesian analyses, are being used
# more regularly. One issue with p-values is that as you run more tests you are likely to get a value which
# is statistically significant just by chance.

# Lets see a simulation of this. First, lets create a data frame of 100 columns, each with 100 numbers
df1=pd.DataFrame([np.random.random(100) for x in range(100)])
df1.head()

Out[14]:

In [15]:

# Pause this and reflect -- do you understand the list comprehension and how I created this DataFrame? You
# don't have to use a list comprehension to do this, but you should be able to read this and figure out how it
# works as this is a commonly used approach on web forums.

In [16]:

# Ok, let's create a second dataframe
df2=pd.DataFrame([np.random.random(100) for x in range(100)])

In [17]:

# Are these two DataFrames the same? Maybe a better question is, for a given row inside of df1, is it the same
# as the row inside df2?

# Let's take a look. Let's say our critical value is 0.1, or and alpha of 10%. And we're going to compare each
# column in df1 to the same numbered column in df2. And we'll report when the p-value isn't less than 10%,
# which means that we have sufficient evidence to say that the columns are different.

# Let's write this in a function called test_columns
def test_columns(alpha=0.1):
    # I want to keep track of how many differ
    num_diff=0
    # And now we can just iterate over the columns
    for col in df1.columns:
        # we can run out ttest_ind between the two dataframes
        teststat,pval=ttest_ind(df1[col],df2[col])
        # and we check the pvalue versus the alpha
        if pval<=alpha:
            # And now we'll just print out if they are different and increment the num_diff
            print("Col {} is statistically significantly different at alpha={}, pval={}".format(col,alpha,pval))
            num_diff=num_diff+1
    # and let's print out some summary stats
    print("Total number different was {}, which is {}%".format(num_diff,float(num_diff)/len(df1.columns)*100))

# And now lets actually run this
test_columns()

Out[17]:

Col 4 is statistically significantly different at alpha=0.1, pval=0.03171904508748956
Col 9 is statistically significantly different at alpha=0.1, pval=0.05535264960034999
Col 13 is statistically significantly different at alpha=0.1, pval=0.06411026546897207
Col 25 is statistically significantly different at alpha=0.1, pval=0.06474837389971169
Col 46 is statistically significantly different at alpha=0.1, pval=0.0766378775191977
Col 73 is statistically significantly different at alpha=0.1, pval=0.021821747279110435
Col 76 is statistically significantly different at alpha=0.1, pval=0.04067259983835964
Col 85 is statistically significantly different at alpha=0.1, pval=0.0029530262143926777
Col 86 is statistically significantly different at alpha=0.1, pval=0.010704932840894129
Total number different was 9, which is 9.0%

In [18]:

# Interesting, so we see that there are a bunch of columns that are different! In fact, that number looks a
# lot like the alpha value we chose. So what's going on - shouldn't all of the columns be the same? Remember
# that all the ttest does is check if two sets are similar given some level of confidence, in our case, 10%.
# The more random comparisons you do, the more will just happen to be the same by chance. In this example, we
# checked 100 columns, so we would expect there to be roughly 10 of them if our alpha was 0.1.

# We can test some other alpha values as well
test_columns(0.05)

Out[18]:

Col 4 is statistically significantly different at alpha=0.05, pval=0.03171904508748956
Col 73 is statistically significantly different at alpha=0.05, pval=0.021821747279110435
Col 76 is statistically significantly different at alpha=0.05, pval=0.04067259983835964
Col 85 is statistically significantly different at alpha=0.05, pval=0.0029530262143926777
Col 86 is statistically significantly different at alpha=0.05, pval=0.010704932840894129
Total number different was 5, which is 5.0%

In [19]:

# So, keep this in mind when you are doing statistical tests like the t-test which has a p-value. Understand
# that this p-value isn't magic, that it's a threshold for you when reporting results and trying to answer
# your hypothesis. What's a reasonable threshold? Depends on your question, and you need to engage domain
# experts to better understand what they would consider significant.

# Just for fun, lets recreate that second dataframe using a non-normal distribution, I'll arbitrarily chose
# chi squared
df2=pd.DataFrame([np.random.chisquare(df=1,size=100) for x in range(100)])
test_columns()

Out[19]:

Col 0 is statistically significantly different at alpha=0.1, pval=0.0006359131132062404
Col 1 is statistically significantly different at alpha=0.1, pval=1.6254596067473314e-05
Col 2 is statistically significantly different at alpha=0.1, pval=0.032780760709076824
Col 3 is statistically significantly different at alpha=0.1, pval=0.0032178974546995965
Col 4 is statistically significantly different at alpha=0.1, pval=0.0433143837061582
Col 5 is statistically significantly different at alpha=0.1, pval=7.931565914508593e-05
Col 6 is statistically significantly different at alpha=0.1, pval=0.0017013488226864455
Col 7 is statistically significantly different at alpha=0.1, pval=0.00035664966978204754
Col 8 is statistically significantly different at alpha=0.1, pval=0.0008750262277182237
Col 9 is statistically significantly different at alpha=0.1, pval=3.9005870973813224e-05
Col 10 is statistically significantly different at alpha=0.1, pval=0.0001340016519264235
Col 11 is statistically significantly different at alpha=0.1, pval=2.9387732963990273e-06
Col 12 is statistically significantly different at alpha=0.1, pval=0.0002593817469470156
Col 13 is statistically significantly different at alpha=0.1, pval=0.0001639697026470583
Col 14 is statistically significantly different at alpha=0.1, pval=0.00015526373271627754
Col 15 is statistically significantly different at alpha=0.1, pval=4.1837854872688554e-05
Col 16 is statistically significantly different at alpha=0.1, pval=0.00139904683631816
Col 17 is statistically significantly different at alpha=0.1, pval=0.008340935008253801
Col 18 is statistically significantly different at alpha=0.1, pval=0.0005255784024108116
Col 19 is statistically significantly different at alpha=0.1, pval=0.0010508383648962462
Col 20 is statistically significantly different at alpha=0.1, pval=9.250384193688497e-05
Col 21 is statistically significantly different at alpha=0.1, pval=0.0004139078346825869
Col 22 is statistically significantly different at alpha=0.1, pval=0.000239886090894122
Col 23 is statistically significantly different at alpha=0.1, pval=0.005114973931480402
Col 24 is statistically significantly different at alpha=0.1, pval=3.9612287309026586e-06
Col 25 is statistically significantly different at alpha=0.1, pval=0.001183662110645545
Col 26 is statistically significantly different at alpha=0.1, pval=0.0013250467524673893
Col 27 is statistically significantly different at alpha=0.1, pval=0.0071413662391291675
Col 28 is statistically significantly different at alpha=0.1, pval=0.0018050525585905151
Col 29 is statistically significantly different at alpha=0.1, pval=1.8443168650897388e-07
Col 30 is statistically significantly different at alpha=0.1, pval=0.00018610780116814143
Col 31 is statistically significantly different at alpha=0.1, pval=0.003288231435508567
Col 32 is statistically significantly different at alpha=0.1, pval=0.00022788735453261414
Col 33 is statistically significantly different at alpha=0.1, pval=0.0005420802253194416
Col 34 is statistically significantly different at alpha=0.1, pval=0.00043337835781774326
Col 35 is statistically significantly different at alpha=0.1, pval=0.0009296626942443332
Col 36 is statistically significantly different at alpha=0.1, pval=0.015486530171592511
Col 37 is statistically significantly different at alpha=0.1, pval=0.08968808146980235
Col 38 is statistically significantly different at alpha=0.1, pval=0.0003984738782135047
Col 39 is statistically significantly different at alpha=0.1, pval=0.00029812833313821313
Col 40 is statistically significantly different at alpha=0.1, pval=0.00020580570179961344
Col 41 is statistically significantly different at alpha=0.1, pval=0.016524830474479658
Col 42 is statistically significantly different at alpha=0.1, pval=0.004414785565347055
Col 43 is statistically significantly different at alpha=0.1, pval=0.0006514221896553882
Col 44 is statistically significantly different at alpha=0.1, pval=0.0009246080966774699
Col 45 is statistically significantly different at alpha=0.1, pval=3.947719385039122e-05
Col 46 is statistically significantly different at alpha=0.1, pval=0.003936803652126271
Col 47 is statistically significantly different at alpha=0.1, pval=0.00021471514569468722
Col 48 is statistically significantly different at alpha=0.1, pval=0.00021533428873488688
Col 49 is statistically significantly different at alpha=0.1, pval=0.08198381088074698
Col 50 is statistically significantly different at alpha=0.1, pval=0.0005833899282268582
Col 51 is statistically significantly different at alpha=0.1, pval=0.0008976227172661998
Col 52 is statistically significantly different at alpha=0.1, pval=0.002183481627605226
Col 53 is statistically significantly different at alpha=0.1, pval=0.00011517331466172489
Col 54 is statistically significantly different at alpha=0.1, pval=0.0004539298416324909
Col 55 is statistically significantly different at alpha=0.1, pval=0.0005972354479418133
Col 56 is statistically significantly different at alpha=0.1, pval=0.00039537371324602574
Col 57 is statistically significantly different at alpha=0.1, pval=1.5992684064779836e-05
Col 58 is statistically significantly different at alpha=0.1, pval=0.0035581119921681737
Col 59 is statistically significantly different at alpha=0.1, pval=0.0009069465410284678
Col 60 is statistically significantly different at alpha=0.1, pval=0.001490617627902622
Col 61 is statistically significantly different at alpha=0.1, pval=9.982665694645061e-05
Col 62 is statistically significantly different at alpha=0.1, pval=0.0001855201502830867
Col 63 is statistically significantly different at alpha=0.1, pval=0.0717892199552874
Col 64 is statistically significantly different at alpha=0.1, pval=0.00011111316914952978
Col 65 is statistically significantly different at alpha=0.1, pval=0.0007326247241706332
Col 66 is statistically significantly different at alpha=0.1, pval=0.0006985304647035251
Col 67 is statistically significantly different at alpha=0.1, pval=4.043055942025079e-05
Col 68 is statistically significantly different at alpha=0.1, pval=0.0030620341209672865
Col 69 is statistically significantly different at alpha=0.1, pval=0.0002566485047083988
Col 70 is statistically significantly different at alpha=0.1, pval=1.5265122338601147e-05
Col 71 is statistically significantly different at alpha=0.1, pval=0.0025159633542264875
Col 72 is statistically significantly different at alpha=0.1, pval=0.006683482780335058
Col 73 is statistically significantly different at alpha=0.1, pval=0.0004996694763506213
Col 74 is statistically significantly different at alpha=0.1, pval=0.00024394144210986108
Col 75 is statistically significantly different at alpha=0.1, pval=7.077435724052419e-05
Col 76 is statistically significantly different at alpha=0.1, pval=0.0074093508129584405
Col 77 is statistically significantly different at alpha=0.1, pval=0.006900999886679228
Col 78 is statistically significantly different at alpha=0.1, pval=0.0010484064032203217
Col 79 is statistically significantly different at alpha=0.1, pval=0.0023846343634806146
Col 80 is statistically significantly different at alpha=0.1, pval=8.180792170754975e-05
Col 81 is statistically significantly different at alpha=0.1, pval=5.49471394287242e-05
Col 82 is statistically significantly different at alpha=0.1, pval=0.00033420295079797965
Col 83 is statistically significantly different at alpha=0.1, pval=0.003652590090290096
Col 84 is statistically significantly different at alpha=0.1, pval=0.0005087442921717129
Col 85 is statistically significantly different at alpha=0.1, pval=0.010877718451872447
Col 86 is statistically significantly different at alpha=0.1, pval=0.006487775025921745
Col 87 is statistically significantly different at alpha=0.1, pval=0.00012983278069724697
Col 88 is statistically significantly different at alpha=0.1, pval=0.0003217680289398193
Col 89 is statistically significantly different at alpha=0.1, pval=0.0007595005579977705
Col 90 is statistically significantly different at alpha=0.1, pval=0.0009007314132232909
Col 91 is statistically significantly different at alpha=0.1, pval=0.0041773223441325755
Col 92 is statistically significantly different at alpha=0.1, pval=0.00013412252503388242
Col 93 is statistically significantly different at alpha=0.1, pval=0.0038256024065565237
Col 94 is statistically significantly different at alpha=0.1, pval=0.0025884060593381543
Col 95 is statistically significantly different at alpha=0.1, pval=0.0008645653623897333
Col 96 is statistically significantly different at alpha=0.1, pval=0.00033563002443083584
Col 97 is statistically significantly different at alpha=0.1, pval=9.000223928968477e-06
Col 98 is statistically significantly different at alpha=0.1, pval=0.007777985999968882
Col 99 is statistically significantly different at alpha=0.1, pval=0.00024422963411248024
Total number different was 100, which is 100.0%

In [20]:

# Now we see that all or most columns test to be statistically significant at the 10% level.

In this lecture, we've discussed just some of the basics of hypothesis testing in Python. I introduced you to the SciPy library, which you can use for the students t test. We've discussed some of the practical issues which arise from looking for statistical significance. There's much more to learn about hypothesis testing, for instance, there are different tests used, depending on the shape of your data and different ways to report results instead of just p-values such as confidence intervals or bayesian analyses. But this should give you a basic idea of where to start when comparing two populations for differences, which is a common task for data scientists.

Product

Resources

Company