Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/T-Test.ipynb
3064 views
Kernel: Python 3 (ipykernel)

1.Introduction to T-Test

The T-test is a statistical significance test used to determine whether a numeric data sample differs significantly from the population or whether two groups have different average values (for example, whether men and women have different average heights). Statistical significance is determined by the size of the difference between the group averages, the sample size, and the standard deviations of the groups.

  • To determine if there is a significant difference between the means of two groups, which may be related in certain features. image.png

  • - If t-value is large => the two groups belong to different groups.
  • - If t-value is small => the two groups belong to same group.

Terminologies in T-Test

  • Degree of freedom (df) – It tells us the number of independent variables used for calculating the estimate between 2 sample groups

    • df=sum(size of the sample S)- 1 Equation - 2

    • for two samples:

    The df would be calculated as

    df = (nA-1) + (nB -1)
  • Significance level (α) – It is the probability of rejecting the null hypothesis when it is true. In simpler terms, it tells us about the percentage of risk involved in saying that a difference exists between two groups, when in reality it does not.

Types of t-tests, and they are categorized as dependent and independent t-tests.

  • Independent samples t-test: compares the means for two groups.

  • Paired sample t-test: compares means from the same group at different times (say, one year apart).

  • One sample t-test test: the mean of a single group against a known mean.

Why we do t-test?

  • To make inference about population beyond our data

  • To check whether the difference between means of two Samples is reliable

  • Comparing Machine Learning Algorithms

1. Independent sample t-test

Formula:

image.png where,

  • t = t-value

  • A = Sample of A

  • B = Sample of B

  • μA = Mean of sample A

  • μB = Mean of sample B

  • nA = samele size of A

  • nB = sample size of B

  • df = degree of freedom

Steps involved:

  • Step 1 - Find the sum of all values in each sample.

  • Step 2 - Square the sum values found in step 1.

  • Step 3 - Find the sum of square of individual values in each sample.

  • Step 4 - Calculate the mean of each sample.

  • Step 5 - Find the degree of freedom (df) using Eq-2.

  • Step 6 - Insert all the values found in Steps 1-4 into Eq-3 and find the calculated t-value.

  • Step 7 - Use the values of df and α (take α = 0.05 if not given) in the two-tails t-table

  • Step 8 - Compare values of t found in Step-6 and Step-7

If tcal > ttable => p < (α=0.05) => significant difference between two groups found.

If tcal < ttable => p > (α=0.05) => no significant difference between two groups.

2. Paired sample t-test

  • Paired sample t-test, commonly known as dependent sample t-test is used to find out if the difference in the mean of two samples is 0.

  • The test is done on dependent samples, usually focusing on a particular group of people or thing. In this, each entity is measured twice, resulting in a pair of observations.

We can use this when:

  • Two similar (twin like) samples are given.

  • The dependent variable (data) is continuous.

  • The observations are independent of one another.

  • The dependent variable is approximately normally distributed.

image.png

3. One sample t-test

One sample t-test is one of the widely used t-tests for comparison of the sample mean of the data to a particularly given value. Used for comparing the sample mean to the true/population mean.

use when the sample size is small. (under 30) data is collected randomly. data is approximately normally distributed.

  • Formula

image-2.png

Steps:

  • Step 1 - Define the null (h0) and alternative (h1) hypothesis.

  • Step 2 - Calculate sample mean. (if not given) [population mean, standard deviation, n is given]

  • Step 3 - Put the values found in Step 1 into Eq-5 and calculate t-value. (tcal)

  • Step 4 - Calculate degree of freedom (df). (same as done in paired sample t-test)

  • Step 5 - Take α = 0.05 if not given. Use the value of df and α and find ttable from one tailed t-table.

  • Step 6 - Compare values of t found in Step-3 and Step-5.

Analysis

Two Independent Sample T-test

The independent t-test is used to test whether population means are significantly different from each other, using the means from randomly drawn samples.

What do you need to run an independent t-test?

In order to run an independent t-test, you need the following: One independent, categorical variable that has two levels/groups. One continuous dependent variable.

Hypothesis Statement:

The null hypothesis for the independent t-test is that the population means from the two unrelated groups are equal: H0: u1 = u2

From a sample of Employee data(fictious data) lets find signifance of relationship between Gender,Performance & Work life balance in company.

  • H0 : mean1=mean2

  • HA : mean1!mean2

Which error would you say is more serious?

  • A false positive (type I error) — when you reject a true null hypothesis

  • A false negative (type II error) — when you accept a false null hypothesis

image.png

Test statistic.

  • The test statistic is a t statistic (t) defined bythe following equation. t = (x - μ) / SE SE =s *sqrt{ ( 1/n ) * [ ( N - n ) / ( N - 1 ) ] } where x is the sample mean, μ is the hypothesized population mean in the null hypothesis,and SE is the standard error.

Importing Required Packages

import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib.pyplot as plt; plt.rcdefaults() import csv import seaborn as sns from scipy.stats import ttest_ind #to run the t-test for independent samples from scipy import stats from scipy.stats import spearmanr #to run spearman %matplotlib inline emp_data = pd.read_excel("Emp_data.xlsx")
emp_data.head()
#first dividing the test into two groups of cases based on gender F = emp_data[emp_data['Gender']=='Female'] M = emp_data[emp_data['Gender']=='Male']
sns.violinplot(x="Gender", y="Performance Rating", data=emp_data)
<matplotlib.axes._subplots.AxesSubplot at 0x20b1410bf88>
Image in a Jupyter notebook

t= diff blw Males and females/ diff within males and females(ratings) t>1 t=0 t=1

ttest_ind(M['Performance Rating'], F['Performance Rating'], nan_policy='omit')
Ttest_indResult(statistic=10.767072782578031, pvalue=4.5014310952740897e-26)

Insight

  • The t score(Statistic) is a ratio between the difference between two groups and the difference within the groups.If the t value is high, it means that the 'net' difference between the scores for EACH participant is relatively large

  • As p value is less than 0.05 there is Significant difference between Genders and Performance Rating.

Reject Null Hypotheis : So the difference is significant

Lets check if there is significant difference between males ,females & Work life Balance

#let's see if there are differences in Work Life Balance ttest_ind(M['Work Life Balance'], F['Work Life Balance'], nan_policy='omit')
Ttest_indResult(statistic=2.796309621675735, pvalue=0.005236371539080133)

Insight

As p value is less than 0.05(0.005) ,so Gender and Work life Balance there is ** THERE IS A significant difference** between two groups.

Let's visualize the distribution of men according to Performance rating since this was significant

M_Performance = M['Performance Rating'].value_counts() print(M_Performance)
3 540 4 253 5 77 2 11 1 1 Name: Performance Rating, dtype: int64
objects = ('Good', 'Excellent', 'Average','Bad',) y_pos = np.arange(len(objects)) performance = [253,77,540,12] plt.bar(y_pos, performance, align='center', color = 'orange') plt.xticks(y_pos, objects) plt.ylabel('Frequency of Performance') plt.title('Males Performance in Company') plt.show()
Image in a Jupyter notebook

Let's visualize the distribution of women according to Performance rating since this was significant

F_Performance = F['Performance Rating'].value_counts() print(F_Performance)
3 399 4 92 2 77 5 16 1 4 Name: Performance Rating, dtype: int64
objects = ('Good', 'Excellent', 'Average','Bad',) y_pos = np.arange(len(objects)) performance = [92,16,399,81] plt.bar(y_pos, performance, align='center', color = 'green') plt.xticks(y_pos, objects) plt.ylabel('Frequency of Performance') plt.title('Females Performance in Company') plt.show()
Image in a Jupyter notebook

Non-Parametric Test

This is test in which even looking at data we don't have any Idea about polulation parameter.

Spearman

Measures the strength of association between two variables and direction of relationship (positive , negative or no-relation).

Why Spearman?

  • We run spearman's r correlations to find relation between variables

  • This test is applicable for norminal & ordinal categorical data.

  • The correlation coefficient is a statistical measure that calculates the strength of the relationship between the relative movements of the two variables.

  • The range of values for the correlation coefficient bounded by 1.0 on an absolute value basis or between -1.0 to 1.0.

H0 : There is no association between WorkLife Balance & Performance Rating

Lets use Spearman test Relation between Work Life Balance & Performance Rating

x = (emp_data['Work Life Balance']) y = (emp_data['Performance Rating']) spearmanr(x,y)
SpearmanrResult(correlation=-0.01154962807338392, pvalue=0.6581581853284915)

Insight

  • No significant association between Work Life Balance & Performance Rating

  • Favours Alternate hypothesis

x = (emp_data['Skills']) y = (emp_data['MonthlyIncome(June-2018)']) spearmanr(x,y)
SpearmanrResult(correlation=0.0651222786562654, pvalue=0.012512430077477259)

Insight

  • Negative Correlation between Skills & Monthly Income

  • Favours Null hypothesis

One-Sample T-Test

A One sample t-test tests the mean of a single group against a known mean.

Hypothesis Statements

  • H0 : Average salary of employees matches population mean

pop = emp_data['MonthlyIncome(June-2018)'] # population mean for 2019
plt=emp_data['MonthlyIncome(June-2018)'].hist(bins=20) plt.set_ylabel('Value') plt.set_xlabel('Salary') plt.set_title('Salary Distribution for 2018 june month',size=17, y=.05,x=1.05,alpha=0.05)
Text(1.05, 0.05, 'Salary Distribution for 2018 june month')
Image in a Jupyter notebook
samp=emp_data['MonthlyIncome(June-2019)']
plt=emp_data['MonthlyIncome(June-2019)'].hist(bins=20) plt.set_ylabel('Value') plt.set_xlabel('Salary') plt.set_title('Salary Distribution for 2019 june month',size=17, y=1.08)
Text(0.5, 1.08, 'Salary Distribution for 2019 june month')
Image in a Jupyter notebook

Lets Check significance of difference between given means

stats.ttest_1samp(a=samp,popmean=pop.mean())
Ttest_1sampResult(statistic=14.199894918280686, pvalue=5.5717678659282984e-43)

Insight

  • As p -value less than 0.05 so Reject null

There is a significant difference

Hence we can conclude the average salary of employees donot matches population mean.

Two-Sample T-Test

image.png

Employee skill set across the company

F_Skills = F['Skills'].value_counts() print(F_Skills)
Sales 242 Research 114 Content Writing 85 Employee relations 51 Workforce Management 47 Delivery 33 Performance Management 16 Name: Skills, dtype: int64
objects = labels = ['Sales', 'Research', 'C_W', 'E_R','W_M','Delivery','Per_M'] y_pos = np.arange(len(objects)) performance = [242,114,85,51,47,33,16] colors = ['green', 'brown', 'gray', 'yellow','blue','red','pink'] plt.bar(y_pos, performance, align='center', color = colors) plt.xticks(y_pos, objects) plt.ylabel('Number of Employees') plt.title('Skills across company') plt.show()

Unpaired T-Test

A Two sample t-test tests the mean of two groups against a known mean.

  • H0= Mean of monthly incomes is same in Sales & Delivery skills

Sales = emp_data.loc[emp_data['Skills'] == 'Sales','MonthlyIncome(June-2018)'] Delivery = emp_data.loc[emp_data['Skills'] == 'Delivery','MonthlyIncome(June-2018)']

Two T-test to check whether the salary mean for Sales Skills is different from Delivery Skills

stats.ttest_ind(a=Sales, b=Delivery, equal_var=False)
Ttest_indResult(statistic=0.8593087292757945, pvalue=0.39216156367042443)

Insight

  • As p value is *greater ** TO 0.05 (0.49) , so fail to reject null Hypothesis

Mean Salary of Employees in Sales & Delivery has not a Significant difference

Paired T-Test

A Two sample t-test within a group at different points of time.

H0 = Mean of Average Income for Females is same as in previous year(2018).

  • (Previous year Income Mean)μ1

  • ( This year Income Mean) μ2 μ1 - μ2 = 0

f2018_income = emp_data.loc[emp_data['Gender'] == 'Female','MonthlyIncome(June-2018)'] f2019_income = emp_data.loc[emp_data['Gender'] == 'Female','MonthlyIncome(June-2019)']
stats.ttest_rel( a=f2018_income, b=f2019_income )
Ttest_relResult(statistic=-19.671096009441541, pvalue=1.5006282761149835e-66)

Insight

  • As p value is less than 0.05(8.09),so Reject null Hypothesis

Mean of Average Income for Females is different as in previous year (difference is significant)