Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ycchen00
GitHub Repository: ycchen00/Introduction-to-Data-Science-in-Python
Path: blob/main/resources/week-4/BasicStatisticalTesting.ipynb
3223 views
Kernel: Python 3

In this lecture we're going to review some of the basics of statistical testing in python. We're going to talk about hypothesis testing, statistical significance, and using scipy to run student's t-tests.

# We use statistics in a lot of different ways in data science, and on this lecture, I want to refresh your # knowledge of hypothesis testing, which is a core data analysis activity behind experimentation. The goal of # hypothesis testing is to determine if, for instance, the two different conditions we have in an experiment # have resulted in different impacts # Let's import our usual numpy and pandas libraries import numpy as np import pandas as pd # Now let's bring in some new libraries from scipy from scipy import stats
# Now, scipy is an interesting collection of libraries for data science and you'll use most or perpahs all of # these libraries. It includes numpy and pandas, but also plotting libraries such as matplotlib, and a # number of scientific library functions as well
# When we do hypothesis testing, we actually have two statements of interest: the first is our actual # explanation, which we call the alternative hypothesis, and the second is that the explanation we have is not # sufficient, and we call this the null hypothesis. Our actual testing method is to determine whether the null # hypothesis is true or not. If we find that there is a difference between groups, then we can reject the null # hypothesis and we accept our alternative. # Let's see an example of this; we're going to use some grade data df=pd.read_csv ('datasets/grades.csv') df.head()
# If we take a look at the data frame inside, we see we have six different assignments. Lets look at some # summary statistics for this DataFrame print("There are {} rows and {} columns".format(df.shape[0], df.shape[1]))
There are 2315 rows and 13 columns
# For the purpose of this lecture, let's segment this population into two pieces. Let's say those who finish # the first assignment by the end of December 2015, we'll call them early finishers, and those who finish it # sometime after that, we'll call them late finishers. early_finishers=df[pd.to_datetime(df['assignment1_submission']) < '2016'] early_finishers.head()
# So, you have lots of skills now with pandas, how would you go about getting the late_finishers dataframe? # Why don't you pause the video and give it a try.
# Here's my solution. First, the dataframe df and the early_finishers share index values, so I really just # want everything in the df which is not in early_finishers late_finishers=df[~df.index.isin(early_finishers.index)] late_finishers.head()
# There are lots of other ways to do this. For instance, you could just copy and paste the first projection # and change the sign from less than to greater than or equal to. This is ok, but if you decide you want to # change the date down the road you have to remember to change it in two places. You could also do a join of # the dataframe df with early_finishers - if you do a left join you only keep the items in the left dataframe, # so this would have been a good answer. You also could have written a function that determines if someone is # early or late, and then called .apply() on the dataframe and added a new column to the dataframe. This is a # pretty reasonable answer as well.
# As you've seen, the pandas data frame object has a variety of statistical functions associated with it. If # we call the mean function directly on the data frame, we see that each of the means for the assignments are # calculated. Let's compare the means for our two populations print(early_finishers['assignment1_grade'].mean()) print(late_finishers['assignment1_grade'].mean())
74.94728457024303 74.0450648477065
# Ok, these look pretty similar. But, are they the same? What do we mean by similar? This is where the # students' t-test comes in. It allows us to form the alternative hypothesis ("These are different") as well # as the null hypothesis ("These are the same") and then test that null hypothesis. # When doing hypothesis testing, we have to choose a significance level as a threshold for how much of a # chance we're willing to accept. This significance level is typically called alpha. #For this example, let's # use a threshold of 0.05 for our alpha or 5%. Now this is a commonly used number but it's really quite # arbitrary. # The SciPy library contains a number of different statistical tests and forms a basis for hypothesis testing # in Python and we're going to use the ttest_ind() function which does an independent t-test (meaning the # populations are not related to one another). The result of ttest_index() are the t-statistic and a p-value. # It's this latter value, the probability, which is most important to us, as it indicates the chance (between # 0 and 1) of our null hypothesis being True. # Let's bring in our ttest_ind function from scipy.stats import ttest_ind # Let's run this function with our two populations, looking at the assignment 1 grades ttest_ind(early_finishers['assignment1_grade'], late_finishers['assignment1_grade'])
Ttest_indResult(statistic=1.322354085372139, pvalue=0.1861810110171455)
# So here we see that the probability is 0.18, and this is above our alpha value of 0.05. This means that we # cannot reject the null hypothesis. The null hypothesis was that the two populations are the same, and we # don't have enough certainty in our evidence (because it is greater than alpha) to come to a conclusion to # the contrary. This doesn't mean that we have proven the populations are the same.
# Why don't we check the other assignment grades? print(ttest_ind(early_finishers['assignment2_grade'], late_finishers['assignment2_grade'])) print(ttest_ind(early_finishers['assignment3_grade'], late_finishers['assignment3_grade'])) print(ttest_ind(early_finishers['assignment4_grade'], late_finishers['assignment4_grade'])) print(ttest_ind(early_finishers['assignment5_grade'], late_finishers['assignment5_grade'])) print(ttest_ind(early_finishers['assignment6_grade'], late_finishers['assignment6_grade']))
Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424) Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865) Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337) Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492) Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)
# Ok, so it looks like in this data we do not have enough evidence to suggest the populations differ with # respect to grade. Let's take a look at those p-values for a moment though, because they are saying things # that can inform experimental design down the road. For instance, one of the assignments, assignment 3, has a # p-value around 0.1. This means that if we accepted a level of chance similarity of 11% this would have been # considered statistically significant. As a research, this would suggest to me that there is something here # worth considering following up on. For instance, if we had a small number of participants (we don't) or if # there was something unique about this assignment as it relates to our experiment (whatever it was) then # there may be followup experiments we could run.
# P-values have come under fire recently for being insuficient for telling us enough about the interactions # which are happening, and two other techniques, confidence intervalues and bayesian analyses, are being used # more regularly. One issue with p-values is that as you run more tests you are likely to get a value which # is statistically significant just by chance. # Lets see a simulation of this. First, lets create a data frame of 100 columns, each with 100 numbers df1=pd.DataFrame([np.random.random(100) for x in range(100)]) df1.head()
# Pause this and reflect -- do you understand the list comprehension and how I created this DataFrame? You # don't have to use a list comprehension to do this, but you should be able to read this and figure out how it # works as this is a commonly used approach on web forums.
# Ok, let's create a second dataframe df2=pd.DataFrame([np.random.random(100) for x in range(100)])
# Are these two DataFrames the same? Maybe a better question is, for a given row inside of df1, is it the same # as the row inside df2? # Let's take a look. Let's say our critical value is 0.1, or and alpha of 10%. And we're going to compare each # column in df1 to the same numbered column in df2. And we'll report when the p-value isn't less than 10%, # which means that we have sufficient evidence to say that the columns are different. # Let's write this in a function called test_columns def test_columns(alpha=0.1): # I want to keep track of how many differ num_diff=0 # And now we can just iterate over the columns for col in df1.columns: # we can run out ttest_ind between the two dataframes teststat,pval=ttest_ind(df1[col],df2[col]) # and we check the pvalue versus the alpha if pval<=alpha: # And now we'll just print out if they are different and increment the num_diff print("Col {} is statistically significantly different at alpha={}, pval={}".format(col,alpha,pval)) num_diff=num_diff+1 # and let's print out some summary stats print("Total number different was {}, which is {}%".format(num_diff,float(num_diff)/len(df1.columns)*100)) # And now lets actually run this test_columns()
Col 4 is statistically significantly different at alpha=0.1, pval=0.03171904508748956 Col 9 is statistically significantly different at alpha=0.1, pval=0.05535264960034999 Col 13 is statistically significantly different at alpha=0.1, pval=0.06411026546897207 Col 25 is statistically significantly different at alpha=0.1, pval=0.06474837389971169 Col 46 is statistically significantly different at alpha=0.1, pval=0.0766378775191977 Col 73 is statistically significantly different at alpha=0.1, pval=0.021821747279110435 Col 76 is statistically significantly different at alpha=0.1, pval=0.04067259983835964 Col 85 is statistically significantly different at alpha=0.1, pval=0.0029530262143926777 Col 86 is statistically significantly different at alpha=0.1, pval=0.010704932840894129 Total number different was 9, which is 9.0%
# Interesting, so we see that there are a bunch of columns that are different! In fact, that number looks a # lot like the alpha value we chose. So what's going on - shouldn't all of the columns be the same? Remember # that all the ttest does is check if two sets are similar given some level of confidence, in our case, 10%. # The more random comparisons you do, the more will just happen to be the same by chance. In this example, we # checked 100 columns, so we would expect there to be roughly 10 of them if our alpha was 0.1. # We can test some other alpha values as well test_columns(0.05)
Col 4 is statistically significantly different at alpha=0.05, pval=0.03171904508748956 Col 73 is statistically significantly different at alpha=0.05, pval=0.021821747279110435 Col 76 is statistically significantly different at alpha=0.05, pval=0.04067259983835964 Col 85 is statistically significantly different at alpha=0.05, pval=0.0029530262143926777 Col 86 is statistically significantly different at alpha=0.05, pval=0.010704932840894129 Total number different was 5, which is 5.0%
# So, keep this in mind when you are doing statistical tests like the t-test which has a p-value. Understand # that this p-value isn't magic, that it's a threshold for you when reporting results and trying to answer # your hypothesis. What's a reasonable threshold? Depends on your question, and you need to engage domain # experts to better understand what they would consider significant. # Just for fun, lets recreate that second dataframe using a non-normal distribution, I'll arbitrarily chose # chi squared df2=pd.DataFrame([np.random.chisquare(df=1,size=100) for x in range(100)]) test_columns()
Col 0 is statistically significantly different at alpha=0.1, pval=0.0006359131132062404 Col 1 is statistically significantly different at alpha=0.1, pval=1.6254596067473314e-05 Col 2 is statistically significantly different at alpha=0.1, pval=0.032780760709076824 Col 3 is statistically significantly different at alpha=0.1, pval=0.0032178974546995965 Col 4 is statistically significantly different at alpha=0.1, pval=0.0433143837061582 Col 5 is statistically significantly different at alpha=0.1, pval=7.931565914508593e-05 Col 6 is statistically significantly different at alpha=0.1, pval=0.0017013488226864455 Col 7 is statistically significantly different at alpha=0.1, pval=0.00035664966978204754 Col 8 is statistically significantly different at alpha=0.1, pval=0.0008750262277182237 Col 9 is statistically significantly different at alpha=0.1, pval=3.9005870973813224e-05 Col 10 is statistically significantly different at alpha=0.1, pval=0.0001340016519264235 Col 11 is statistically significantly different at alpha=0.1, pval=2.9387732963990273e-06 Col 12 is statistically significantly different at alpha=0.1, pval=0.0002593817469470156 Col 13 is statistically significantly different at alpha=0.1, pval=0.0001639697026470583 Col 14 is statistically significantly different at alpha=0.1, pval=0.00015526373271627754 Col 15 is statistically significantly different at alpha=0.1, pval=4.1837854872688554e-05 Col 16 is statistically significantly different at alpha=0.1, pval=0.00139904683631816 Col 17 is statistically significantly different at alpha=0.1, pval=0.008340935008253801 Col 18 is statistically significantly different at alpha=0.1, pval=0.0005255784024108116 Col 19 is statistically significantly different at alpha=0.1, pval=0.0010508383648962462 Col 20 is statistically significantly different at alpha=0.1, pval=9.250384193688497e-05 Col 21 is statistically significantly different at alpha=0.1, pval=0.0004139078346825869 Col 22 is statistically significantly different at alpha=0.1, pval=0.000239886090894122 Col 23 is statistically significantly different at alpha=0.1, pval=0.005114973931480402 Col 24 is statistically significantly different at alpha=0.1, pval=3.9612287309026586e-06 Col 25 is statistically significantly different at alpha=0.1, pval=0.001183662110645545 Col 26 is statistically significantly different at alpha=0.1, pval=0.0013250467524673893 Col 27 is statistically significantly different at alpha=0.1, pval=0.0071413662391291675 Col 28 is statistically significantly different at alpha=0.1, pval=0.0018050525585905151 Col 29 is statistically significantly different at alpha=0.1, pval=1.8443168650897388e-07 Col 30 is statistically significantly different at alpha=0.1, pval=0.00018610780116814143 Col 31 is statistically significantly different at alpha=0.1, pval=0.003288231435508567 Col 32 is statistically significantly different at alpha=0.1, pval=0.00022788735453261414 Col 33 is statistically significantly different at alpha=0.1, pval=0.0005420802253194416 Col 34 is statistically significantly different at alpha=0.1, pval=0.00043337835781774326 Col 35 is statistically significantly different at alpha=0.1, pval=0.0009296626942443332 Col 36 is statistically significantly different at alpha=0.1, pval=0.015486530171592511 Col 37 is statistically significantly different at alpha=0.1, pval=0.08968808146980235 Col 38 is statistically significantly different at alpha=0.1, pval=0.0003984738782135047 Col 39 is statistically significantly different at alpha=0.1, pval=0.00029812833313821313 Col 40 is statistically significantly different at alpha=0.1, pval=0.00020580570179961344 Col 41 is statistically significantly different at alpha=0.1, pval=0.016524830474479658 Col 42 is statistically significantly different at alpha=0.1, pval=0.004414785565347055 Col 43 is statistically significantly different at alpha=0.1, pval=0.0006514221896553882 Col 44 is statistically significantly different at alpha=0.1, pval=0.0009246080966774699 Col 45 is statistically significantly different at alpha=0.1, pval=3.947719385039122e-05 Col 46 is statistically significantly different at alpha=0.1, pval=0.003936803652126271 Col 47 is statistically significantly different at alpha=0.1, pval=0.00021471514569468722 Col 48 is statistically significantly different at alpha=0.1, pval=0.00021533428873488688 Col 49 is statistically significantly different at alpha=0.1, pval=0.08198381088074698 Col 50 is statistically significantly different at alpha=0.1, pval=0.0005833899282268582 Col 51 is statistically significantly different at alpha=0.1, pval=0.0008976227172661998 Col 52 is statistically significantly different at alpha=0.1, pval=0.002183481627605226 Col 53 is statistically significantly different at alpha=0.1, pval=0.00011517331466172489 Col 54 is statistically significantly different at alpha=0.1, pval=0.0004539298416324909 Col 55 is statistically significantly different at alpha=0.1, pval=0.0005972354479418133 Col 56 is statistically significantly different at alpha=0.1, pval=0.00039537371324602574 Col 57 is statistically significantly different at alpha=0.1, pval=1.5992684064779836e-05 Col 58 is statistically significantly different at alpha=0.1, pval=0.0035581119921681737 Col 59 is statistically significantly different at alpha=0.1, pval=0.0009069465410284678 Col 60 is statistically significantly different at alpha=0.1, pval=0.001490617627902622 Col 61 is statistically significantly different at alpha=0.1, pval=9.982665694645061e-05 Col 62 is statistically significantly different at alpha=0.1, pval=0.0001855201502830867 Col 63 is statistically significantly different at alpha=0.1, pval=0.0717892199552874 Col 64 is statistically significantly different at alpha=0.1, pval=0.00011111316914952978 Col 65 is statistically significantly different at alpha=0.1, pval=0.0007326247241706332 Col 66 is statistically significantly different at alpha=0.1, pval=0.0006985304647035251 Col 67 is statistically significantly different at alpha=0.1, pval=4.043055942025079e-05 Col 68 is statistically significantly different at alpha=0.1, pval=0.0030620341209672865 Col 69 is statistically significantly different at alpha=0.1, pval=0.0002566485047083988 Col 70 is statistically significantly different at alpha=0.1, pval=1.5265122338601147e-05 Col 71 is statistically significantly different at alpha=0.1, pval=0.0025159633542264875 Col 72 is statistically significantly different at alpha=0.1, pval=0.006683482780335058 Col 73 is statistically significantly different at alpha=0.1, pval=0.0004996694763506213 Col 74 is statistically significantly different at alpha=0.1, pval=0.00024394144210986108 Col 75 is statistically significantly different at alpha=0.1, pval=7.077435724052419e-05 Col 76 is statistically significantly different at alpha=0.1, pval=0.0074093508129584405 Col 77 is statistically significantly different at alpha=0.1, pval=0.006900999886679228 Col 78 is statistically significantly different at alpha=0.1, pval=0.0010484064032203217 Col 79 is statistically significantly different at alpha=0.1, pval=0.0023846343634806146 Col 80 is statistically significantly different at alpha=0.1, pval=8.180792170754975e-05 Col 81 is statistically significantly different at alpha=0.1, pval=5.49471394287242e-05 Col 82 is statistically significantly different at alpha=0.1, pval=0.00033420295079797965 Col 83 is statistically significantly different at alpha=0.1, pval=0.003652590090290096 Col 84 is statistically significantly different at alpha=0.1, pval=0.0005087442921717129 Col 85 is statistically significantly different at alpha=0.1, pval=0.010877718451872447 Col 86 is statistically significantly different at alpha=0.1, pval=0.006487775025921745 Col 87 is statistically significantly different at alpha=0.1, pval=0.00012983278069724697 Col 88 is statistically significantly different at alpha=0.1, pval=0.0003217680289398193 Col 89 is statistically significantly different at alpha=0.1, pval=0.0007595005579977705 Col 90 is statistically significantly different at alpha=0.1, pval=0.0009007314132232909 Col 91 is statistically significantly different at alpha=0.1, pval=0.0041773223441325755 Col 92 is statistically significantly different at alpha=0.1, pval=0.00013412252503388242 Col 93 is statistically significantly different at alpha=0.1, pval=0.0038256024065565237 Col 94 is statistically significantly different at alpha=0.1, pval=0.0025884060593381543 Col 95 is statistically significantly different at alpha=0.1, pval=0.0008645653623897333 Col 96 is statistically significantly different at alpha=0.1, pval=0.00033563002443083584 Col 97 is statistically significantly different at alpha=0.1, pval=9.000223928968477e-06 Col 98 is statistically significantly different at alpha=0.1, pval=0.007777985999968882 Col 99 is statistically significantly different at alpha=0.1, pval=0.00024422963411248024 Total number different was 100, which is 100.0%
# Now we see that all or most columns test to be statistically significant at the 10% level.

In this lecture, we've discussed just some of the basics of hypothesis testing in Python. I introduced you to the SciPy library, which you can use for the students t test. We've discussed some of the practical issues which arise from looking for statistical significance. There's much more to learn about hypothesis testing, for instance, there are different tests used, depending on the shape of your data and different ways to report results instead of just p-values such as confidence intervals or bayesian analyses. But this should give you a basic idea of where to start when comparing two populations for differences, which is a common task for data scientists.