CoCalc -- T-Test.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/T-Test.ipynb
³⁰⁶⁴ views

Kernel: Python 3 (ipykernel)

Introduction to T- Test
Analysis
- 2.1 Two Independent Sample T-Test
  - 2.1.1 Hypothesis Statement
  - 2.1.2 Implementation
- 2.2 Non-Parametric Test
  - 2.2.1 Spearman Coefficient
  - 2.2.2 Implementation
- 2.3 One Sample Test
  - 2.3.1 Hypothesis
  - 2.3.2 Implementation
- 2.4 Two Sample Test
  - 2.3.2 UnPaired
  - 2.3.3 Paired

1.Introduction to T-Test

The T-test is a statistical significance test used to determine whether a numeric data sample differs significantly from the population or whether two groups have different average values (for example, whether men and women have different average heights). Statistical significance is determined by the size of the difference between the group averages, the sample size, and the standard deviations of the groups.

To determine if there is a significant difference between the means of two groups, which may be related in certain features.

 - If t-value is large => the two groups belong to different groups.

    - If t-value is small => the two groups belong to same group.

Terminologies in T-Test

Degree of freedom (df) – It tells us the number of independent variables used for calculating the estimate between 2 sample groups
- df=sum(size of the sample S)- 1 Equation - 2
- for two samples:
The df would be calculated as
```
   df = (nA-1) + (nB -1)
```
Significance level (α) – It is the probability of rejecting the null hypothesis when it is true. In simpler terms, it tells us about the percentage of risk involved in saying that a difference exists between two groups, when in reality it does not.

Types of t-tests, and they are categorized as dependent and independent t-tests.

Independent samples t-test: compares the means for two groups.
Paired sample t-test: compares means from the same group at different times (say, one year apart).
One sample t-test test: the mean of a single group against a known mean.

Why we do t-test?

To make inference about population beyond our data
To check whether the difference between means of two Samples is reliable
Comparing Machine Learning Algorithms

1. Independent sample t-test

Formula:

where,

t = t-value
A = Sample of A
B = Sample of B
μA = Mean of sample A
μB = Mean of sample B
nA = samele size of A
nB = sample size of B
df = degree of freedom

Steps involved:

Step 1 - Find the sum of all values in each sample.
Step 2 - Square the sum values found in step 1.
Step 3 - Find the sum of square of individual values in each sample.
Step 4 - Calculate the mean of each sample.
Step 5 - Find the degree of freedom (df) using Eq-2.
Step 6 - Insert all the values found in Steps 1-4 into Eq-3 and find the calculated t-value.
Step 7 - Use the values of df and α (take α = 0.05 if not given) in the two-tails t-table
Step 8 - Compare values of t found in Step-6 and Step-7

If tcal > ttable => p < (α=0.05) => significant difference between two groups found.

If tcal < ttable => p > (α=0.05) => no significant difference between two groups.

2. Paired sample t-test

Paired sample t-test, commonly known as dependent sample t-test is used to find out if the difference in the mean of two samples is 0.
The test is done on dependent samples, usually focusing on a particular group of people or thing. In this, each entity is measured twice, resulting in a pair of observations.

We can use this when:

Two similar (twin like) samples are given.
The dependent variable (data) is continuous.
The observations are independent of one another.
The dependent variable is approximately normally distributed.

3. One sample t-test

One sample t-test is one of the widely used t-tests for comparison of the sample mean of the data to a particularly given value. Used for comparing the sample mean to the true/population mean.

use when the sample size is small. (under 30) data is collected randomly. data is approximately normally distributed.

Formula

Steps:

Step 1 - Define the null (h0) and alternative (h1) hypothesis.
Step 2 - Calculate sample mean. (if not given) [population mean, standard deviation, n is given]
Step 3 - Put the values found in Step 1 into Eq-5 and calculate t-value. (tcal)
Step 4 - Calculate degree of freedom (df). (same as done in paired sample t-test)
Step 5 - Take α = 0.05 if not given. Use the value of df and α and find ttable from one tailed t-table.
Step 6 - Compare values of t found in Step-3 and Step-5.

Analysis

Two Independent Sample T-test

The independent t-test is used to test whether population means are significantly different from each other, using the means from randomly drawn samples.

What do you need to run an independent t-test?

In order to run an independent t-test, you need the following: One independent, categorical variable that has two levels/groups. One continuous dependent variable.

Hypothesis Statement:

The null hypothesis for the independent t-test is that the population means from the two unrelated groups are equal: H0: u1 = u2

From a sample of Employee data(fictious data) lets find signifance of relationship between Gender,Performance & Work life balance in company.

H₀ : mean1=mean2
H_A : mean1!mean2

Which error would you say is more serious?

A false positive (type I error) — when you reject a true null hypothesis
A false negative (type II error) — when you accept a false null hypothesis

Test statistic.

The test statistic is a t statistic (t) defined bythe following equation. t = (x - μ) / SE SE =s *sqrt{ ( 1/n ) * [ ( N - n ) / ( N - 1 ) ] } where x is the sample mean, μ is the hypothesized population mean in the null hypothesis,and SE is the standard error.

Importing Required Packages

In [3]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt; plt.rcdefaults() 
import csv
import seaborn as sns
from scipy.stats import ttest_ind #to run the t-test for independent samples
from scipy import stats
from scipy.stats import spearmanr #to run spearman
%matplotlib inline
emp_data = pd.read_excel("Emp_data.xlsx")

In [19]:

emp_data.head()

Out[19]:

In [20]:

#first dividing the test into two groups of cases based on gender
F = emp_data[emp_data['Gender']=='Female']
M = emp_data[emp_data['Gender']=='Male']

In [21]:

sns.violinplot(x="Gender", y="Performance Rating", data=emp_data)

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x20b1410bf88>

t= diff blw Males and females/ diff within males and females(ratings) t>1 t=0 t=1

In [22]:

ttest_ind(M['Performance Rating'], F['Performance Rating'], nan_policy='omit')

Out[22]:

Ttest_indResult(statistic=10.767072782578031, pvalue=4.5014310952740897e-26)

Insight

The t score(Statistic) is a ratio between the difference between two groups and the difference within the groups.If the t value is high, it means that the 'net' difference between the scores for EACH participant is relatively large
As p value is less than 0.05 there is Significant difference between Genders and Performance Rating.

Reject Null Hypotheis : So the difference is significant

Lets check if there is significant difference between males ,females & Work life Balance

In [23]:

#let's see if there are differences in Work Life Balance 
ttest_ind(M['Work Life Balance'], F['Work Life Balance'], nan_policy='omit')

Out[23]:

Ttest_indResult(statistic=2.796309621675735, pvalue=0.005236371539080133)

Insight

As p value is less than 0.05(0.005) ,so Gender and Work life Balance there is ** THERE IS A significant difference** between two groups.

Let's visualize the distribution of men according to Performance rating since this was significant

In [24]:


M_Performance = M['Performance Rating'].value_counts()
print(M_Performance)

Out[24]:

  540
  253
   77
   11
    1
Name: Performance Rating, dtype: int64

In [25]:

objects = ('Good', 'Excellent', 'Average','Bad',)
y_pos = np.arange(len(objects))
performance = [253,77,540,12]
 
plt.bar(y_pos, performance, align='center', color = 'orange')
plt.xticks(y_pos, objects)
plt.ylabel('Frequency of Performance')
plt.title('Males Performance in Company')
 
plt.show()

Out[25]:

Let's visualize the distribution of women according to Performance rating since this was significant

In [26]:


F_Performance = F['Performance Rating'].value_counts()
print(F_Performance)

Out[26]:

  399
   92
   77
   16
    4
Name: Performance Rating, dtype: int64

In [27]:

objects = ('Good', 'Excellent', 'Average','Bad',)
y_pos = np.arange(len(objects))
performance = [92,16,399,81]
 
plt.bar(y_pos, performance, align='center', color = 'green')
plt.xticks(y_pos, objects)
plt.ylabel('Frequency of Performance')
plt.title('Females Performance in Company')
 
plt.show()

Out[27]:

Non-Parametric Test

This is test in which even looking at data we don't have any Idea about polulation parameter.

Spearman

Measures the strength of association between two variables and direction of relationship (positive , negative or no-relation).

Why Spearman?

We run spearman's r correlations to find relation between variables
This test is applicable for norminal & ordinal categorical data.
The correlation coefficient is a statistical measure that calculates the strength of the relationship between the relative movements of the two variables.
The range of values for the correlation coefficient bounded by 1.0 on an absolute value basis or between -1.0 to 1.0.

H₀ : There is no association between WorkLife Balance & Performance Rating

Lets use Spearman test Relation between Work Life Balance & Performance Rating

In [28]:

x = (emp_data['Work Life Balance'])
y = (emp_data['Performance Rating'])
spearmanr(x,y)

Out[28]:

SpearmanrResult(correlation=-0.01154962807338392, pvalue=0.6581581853284915)

Insight

No significant association between Work Life Balance & Performance Rating
Favours Alternate hypothesis

In [29]:

x = (emp_data['Skills'])
y = (emp_data['MonthlyIncome(June-2018)'])
spearmanr(x,y)

Out[29]:

SpearmanrResult(correlation=0.0651222786562654, pvalue=0.012512430077477259)

Insight

Negative Correlation between Skills & Monthly Income
Favours Null hypothesis

One-Sample T-Test

A One sample t-test tests the mean of a single group against a known mean.

Hypothesis Statements

H₀ : Average salary of employees matches population mean

In [30]:

pop = emp_data['MonthlyIncome(June-2018)'] # population mean for 2019

In [31]:

plt=emp_data['MonthlyIncome(June-2018)'].hist(bins=20)
plt.set_ylabel('Value')
plt.set_xlabel('Salary')
plt.set_title('Salary Distribution for 2018 june month',size=17, y=.05,x=1.05,alpha=0.05)

Out[31]:

Text(1.05, 0.05, 'Salary Distribution for 2018 june month')

In [32]:

samp=emp_data['MonthlyIncome(June-2019)']

In [33]:

plt=emp_data['MonthlyIncome(June-2019)'].hist(bins=20)
plt.set_ylabel('Value')
plt.set_xlabel('Salary')
plt.set_title('Salary Distribution for 2019 june month',size=17, y=1.08)

Out[33]:

Text(0.5, 1.08, 'Salary Distribution for 2019 june month')

Lets Check significance of difference between given means

In [40]:

stats.ttest_1samp(a=samp,popmean=pop.mean())

Out[40]:

Ttest_1sampResult(statistic=14.199894918280686, pvalue=5.5717678659282984e-43)

Insight

As p -value less than 0.05 so Reject null

There is a significant difference

Hence we can conclude the average salary of employees donot matches population mean.

Two-Sample T-Test

Employee skill set across the company

In [35]:

F_Skills = F['Skills'].value_counts()
print(F_Skills)

Out[35]:

Sales                     242
Research                  114
Content Writing            85
Employee relations         51
Workforce Management       47
Delivery                   33
Performance Management     16
Name: Skills, dtype: int64

objects = labels = ['Sales', 'Research', 'C_W', 'E_R','W_M','Delivery','Per_M']
y_pos = np.arange(len(objects))
performance = [242,114,85,51,47,33,16]
colors = ['green', 'brown', 'gray', 'yellow','blue','red','pink']
plt.bar(y_pos, performance, align='center', color = colors)
plt.xticks(y_pos, objects)
plt.ylabel('Number of Employees')
plt.title('Skills across company')
 
plt.show()

Unpaired T-Test

A Two sample t-test tests the mean of two groups against a known mean.

H₀= Mean of monthly incomes is same in Sales & Delivery skills

In [41]:

Sales = emp_data.loc[emp_data['Skills'] == 'Sales','MonthlyIncome(June-2018)']
Delivery = emp_data.loc[emp_data['Skills'] == 'Delivery','MonthlyIncome(June-2018)']

Two T-test to check whether the salary mean for Sales Skills is different from Delivery Skills

In [42]:

stats.ttest_ind(a=Sales,
                b=Delivery,
                equal_var=False)

Out[42]:

Ttest_indResult(statistic=0.8593087292757945, pvalue=0.39216156367042443)

Insight

As p value is *greater ** TO 0.05 (0.49) , so fail to reject null Hypothesis

Mean Salary of Employees in Sales & Delivery has not a Significant difference

Paired T-Test

A Two sample t-test within a group at different points of time.

H₀ = Mean of Average Income for Females is same as in previous year(2018).

(Previous year Income Mean)μ₁
( This year Income Mean) μ₂ μ1 - μ2 = 0

In [4]:

f2018_income = emp_data.loc[emp_data['Gender'] == 'Female','MonthlyIncome(June-2018)']
f2019_income = emp_data.loc[emp_data['Gender'] == 'Female','MonthlyIncome(June-2019)']

In [7]:

stats.ttest_rel( a=f2018_income,
                 b=f2019_income
                )

Out[7]:

Ttest_relResult(statistic=-19.671096009441541, pvalue=1.5006282761149835e-66)

Insight

As p value is less than 0.05(8.09),so Reject null Hypothesis

Table of Contents

1.Introduction to T-Test

Terminologies in T-Test

Types of t-tests, and they are categorized as dependent and independent t-tests.

Why we do t-test?

1. Independent sample t-test

Formula:

Steps involved:

If tcal > ttable => p < (α=0.05) => significant difference between two groups found.

If tcal < ttable => p > (α=0.05) => no significant difference between two groups.

2. Paired sample t-test

We can use this when:

3. One sample t-test

use when the sample size is small. (under 30) data is collected randomly. data is approximately normally distributed.

Steps:

Analysis

Two Independent Sample T-test

What do you need to run an independent t-test?

Hypothesis Statement:

Which error would you say is more serious?

Test statistic.

Importing Required Packages

Insight

Reject Null Hypotheis : So the difference is significant

Lets check if there is significant difference between males ,females & Work life Balance

Insight

Let's visualize the distribution of men according to Performance rating since this was significant

Let's visualize the distribution of women according to Performance rating since this was significant

Non-Parametric Test

Spearman

Why Spearman?

Lets use Spearman test Relation between Work Life Balance & Performance Rating

One-Sample T-Test

Hypothesis Statements

Lets Check significance of difference between given means

There is a significant difference

Hence we can conclude the average salary of employees donot matches population mean.

Two-Sample T-Test

Employee skill set across the company

Unpaired T-Test

Two T-test to check whether the salary mean for Sales Skills is different from Delivery Skills

Mean Salary of Employees in Sales & Delivery has not a Significant difference

Paired T-Test

Mean of Average Income for Females is different as in previous year (difference is significant)