CoCalc -- T-Test.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Stats/T-Test.ipynb
³⁰⁷⁹ views

Kernel: Python 3 (ipykernel)

T-Test

Introduction to T- Test
Analysis
- 2.1 Two Independent Sample T-Test
  - 2.1.1 Hypothesis Statement
  - 2.1.2 Implementation
- 2.2 Non-Parametric Test
  - 2.2.1 Spearman Coefficient
  - 2.2.2 Implementation
- 2.3 One Sample Test
  - 2.3.1 Hypothesis
  - 2.3.2 Implementation
- 2.4 Two Sample Test
  - 2.3.2 UnPaired
  - 2.3.3 Paired

1.Introduction to T-Test

The T-test is a statistical significance test used to determine whether a numeric data sample differs significantly from the population or whether two groups have different average values (for example, whether men and women have different average heights). Statistical significance is determined by the size of the difference between the group averages, the sample size, and the standard deviations of the groups.

To determine if there is a significant difference between the means of two groups, which may be related in certain features.

 - If t-value is large => the two groups belong to different groups.

    - If t-value is small => the two groups belong to same group.

Terminologies in T-Test

Degree of freedom (df) – It tells us the number of independent variables used for calculating the estimate between sample groups
- df=sum(size of the sample S)- 1
***if we have 1000 data points in samples the df = 1000-1 = 999
- for two samples:
The df would be calculated as (nA is size of sample A and nB is size of sample B)
```
   df = (nA-1) + (nB -1)
```
Significance level (α) – It is the probability of rejecting the null hypothesis when it is true. In simpler terms, it tells us about the percentage of risk involved in saying that a difference exists between two groups, when in reality it does not.

Types of t-tests, and they are categorized as dependent and independent t-tests.

Independent samples t-test: compares the means for two groups.
Paired sample t-test: compares means from the same group at different times (say, one year apart).
One sample t-test test: the mean of a single group against a known mean.

Why we do t-test?

To make inference about population beyond our data
To check whether the difference between means of two Samples is reliable
Comparing Machine Learning Algorithms

1. Independent sample t-test

Formula:

where,

t = t-value
A = Sample of A
B = Sample of B
μA = Mean of sample A
μB = Mean of sample B
nA = sample size of A
nB = sample size of B
df = degree of freedom

Steps involved:

Step 1 - Find the sum of all values in each sample.
Step 2 - Square the sum values found in step 1.
Step 3 - Find the sum of square of individual values in each sample.
Step 4 - Calculate the mean of each sample.
Step 5 - Find the degree of freedom (df) using Eq-2.
Step 6 - Insert all the values found in Steps 1-4 into Eq-3 and find the calculated t-value.
Step 7 - Use the values of df and α (take α = 0.05 if not given) in the two-tails t-table
Step 8 - Compare values of t found in Step-6 and Step-7

If tcal > ttable => p < (α=0.05) => significant difference between two groups found.

If tcal < ttable => p > (α=0.05) => no significant difference between two groups.

2. Paired sample t-test

Paired sample t-test, commonly known as dependent sample t-test is used to find out if the difference in the mean of two samples is 0.
The test is done on dependent samples, usually focusing on a particular group of people or thing. In this, each entity is measured twice, resulting in a pair of observations.

We can use this when:

Two similar (twin like) samples are given.
The dependent variable (data) is continuous.
The observations are independent of one another.
The dependent variable is approximately normally distributed.

3. One sample t-test

One sample t-test is one of the widely used t-tests for comparison of the sample mean of the data to a particularly given value. Used for comparing the sample mean to the true/population mean.

use when the sample size is small. (under 30) data is collected randomly. data is approximately normally distributed.

Formula

Steps:

Step 1 - Define the null (h0) and alternative (h1) hypothesis.
Step 2 - Calculate sample mean. (if not given) [population mean, standard deviation, n is given]
Step 3 - Put the values found in Step 1 into Eq-5 and calculate t-value. (tcal)
Step 4 - Calculate degree of freedom (df). (same as done in paired sample t-test)
Step 5 - Take α = 0.05 if not given. Use the value of df and α and find ttable from one tailed t-table.
Step 6 - Compare values of t found in Step-3 and Step-5.

Analysis

Two Independent Sample T-test

The independent t-test is used to test whether two groups means are significantly different from each other, using the means from randomly drawn samples.

What do you need to run an independent t-test?

In order to run an independent t-test, you need the following: One independent, categorical variable that has two levels/groups. One continuous dependent variable.

Hypothesis Statement:

The null hypothesis for the independent t-test is that the population means from the two unrelated groups are equal: H0: u1 = u2

From a sample of Employee data(fictious data) lets find signifance of relationship between Gender,Performance & Work life balance in company.

H₀ : mean1=mean2 #Mean of two groups is same
H_A : mean1!mean2 # Two groups are significantly different

Which error would you say is more serious?

A false positive (type I error) — when you reject a true null hypothesis
A false negative (type II error) — when you accept a false null hypothesis

Importing Required Packages

In [17]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt; plt.rcdefaults() 
import csv
import seaborn as sns
from scipy.stats import ttest_ind #to run the t-test for independent samples
from scipy import stats
from scipy.stats import spearmanr #to run spearman
%matplotlib inline
emp_data = pd.read_excel("Emp_data.xlsx")

In [18]:

emp_data.head()

Out[18]:

In [19]:

emp_data.shape

Out[19]:

(1599, 9)

In [20]:

emp_data.isnull().sum()

Out[20]:

Gender                      0
Employee Number             0
Skills                      0
Total Working Years         0
Work Life Balance           0
Performance Rating          0
Years At Company            0
MonthlyIncome(June-2018)    0
MonthlyIncome(June-2019)    0
dtype: int64

In [21]:

#first dividing the test into two groups of cases based on gender
G1_Female = emp_data[emp_data['Gender']=='Female']
G2_Male=emp_data[emp_data['Gender']=='Male']

In [22]:

sns.violinplot(x="Gender", y="Performance Rating", data=emp_data)

Out[22]:

<AxesSubplot:xlabel='Gender', ylabel='Performance Rating'>

let us run t_sample independent test to check significance between performance of G1(Female) and G2(Male)

H0 : Performance is same for G1 and G2
HA: There is difference in performance

In [23]:

G1_Female['Performance Rating'].mean()

Out[23]:

3.4730822873082143

In [24]:

G2_Male['Performance Rating'].mean()

Out[24]:

3.4546485260770976

In [25]:

ttest_ind(G1_Female['Performance Rating'], G2_Male['Performance Rating'], nan_policy='omit')

Out[25]:

Ttest_indResult(statistic=0.5599188436198852, pvalue=0.5756133206144642)

Insight

AS t stas is less than 1 and P value greater than 0.05 so we failed to reject null Hypothesis

Lets check if there is significant difference between males ,females & Work life Balance

In [26]:

#let's see if there are differences in Work Life Balance 
ttest_ind(G1_Female['Work Life Balance'], G2_Male['Work Life Balance'], nan_policy='omit')

Out[26]:

Ttest_indResult(statistic=2.647372675704662, pvalue=0.00819178413824507)

Insight

T value and p value supports to reject null that means there is a significance difference between WFH for G1 and G2

In [29]:

G2_Male['Work Life Balance'].mean()

Out[29]:

2.878684807256236

In [28]:

## Let check WFH between group
G1_Female['Work Life Balance'].mean()

Out[28]:

2.9860529986052997

Let's visualize the distribution of men according to WLF rating since this was significant

In [31]:


M_WLB = G2_Male['Work Life Balance'].value_counts()
print(M_WLB)

Out[31]:

0    484
0    192
0    105
5     52
0     49
Name: Work Life Balance, dtype: int64

In [44]:

objects = ('Bad', 'Average', 'Good','Excellent',)
y_pos = np.arange(len(objects))
performance = [241,484,105,52]
 
plt.bar(y_pos, performance, align='center', color = 'green')
plt.xticks(y_pos, objects)
plt.ylabel('Frequency of WFH')
plt.title('Males WLB in Company')
 
plt.show()

Out[44]:

In [41]:

F_WLB = G1_Female['Work Life Balance'].value_counts()
print(F_WLB)

Out[41]:

0    365
0    186
0    136
0     30
Name: Work Life Balance, dtype: int64

Let's visualize the distribution of women according to WLF rating since this was significant

In [46]:

objects = ('Bad', 'Average', 'Good','Excellent',)
y_pos = np.arange(len(objects))
performance = [166,365,186,0]
 
plt.bar(y_pos, performance, align='center', color = 'pink')
plt.xticks(y_pos, objects)
plt.ylabel('Frequency of WFH')
plt.title('Frequency WLB in Company')
 
plt.show()

Out[46]:

Non-Parametric Test

This is test in which even looking at data we don't have any Idea about polulation parameter.

Spearman

Measures the strength of association between two variables and direction of relationship (positive , negative or no-relation).

Why Spearman?

We run spearman's r correlations to find relation between variables
This test is applicable for norminal & ordinal categorical data.
The correlation coefficient is a statistical measure that calculates the strength of the relationship between the relative movements of the two variables.
The range of values for the correlation coefficient bounded by 1.0 on an absolute value basis or between -1.0 to 1.0.

H₀ : There is no association between WorkLife Balance & Performance Rating

Lets use Spearman test Relation between Work Life Balance & Performance Rating

-1,0,1 negative trend, no asso, postive trend

In [50]:

x = (emp_data['Work Life Balance'])
y = (emp_data['Performance Rating'])
spearmanr(x,y)

Out[50]:

SignificanceResult(statistic=-0.2314514515335428, pvalue=6.883911870509208e-21)

Insight

Negative Correlation between Work Life Balance & Performance Rating
Favours Alternate hypothesis

In [48]:

x = (emp_data['Skills'])
y = (emp_data['MonthlyIncome(June-2018)'])
spearmanr(x,y)

Out[48]:

SignificanceResult(statistic=0.03091689979127544, pvalue=0.2166005785416596)

Insight

no Correlation between Skills & Monthly Income
Favours Null hypothesis

One-Sample T-Test

A One sample t-test tests the mean of a single group against a known mean.

Hypothesis Statements

H₀ : Average salary of employees matches population mean

In [51]:

pop = emp_data['MonthlyIncome(June-2018)'] # population mean for 2019

In [52]:

plt=emp_data['MonthlyIncome(June-2018)'].hist(bins=20)
plt.set_ylabel('Value')
plt.set_xlabel('Salary')
plt.set_title('Salary Distributiorn for 2018 june month',size=17, y=.05,x=1.05,alpha=0.05)

Out[52]:

Text(1.05, 0.05, 'Salary Distributiorn for 2018 june month')

In [53]:

samp=emp_data['MonthlyIncome(June-2019)']

In [54]:

plt=emp_data['MonthlyIncome(June-2019)'].hist(bins=20)
plt.set_ylabel('Value')
plt.set_xlabel('Salary')
plt.set_title('Salary Distribution for 2019 june month',size=17, y=1.08)

Out[54]:

Text(0.5, 1.08, 'Salary Distribution for 2019 june month')

Lets Check significance of difference between given means

In [55]:

stats.ttest_1samp(a=samp,popmean=pop.mean())

Out[55]:

TtestResult(statistic=10.437050076220812, pvalue=1.0251793002869577e-24, df=1598)

Insight

As p -value less than 0.05 so Reject null

There is a significant difference

Hence we can conclude the average salary of employees donot matches population mean.

Two-Sample T-Test

Employee skill set across the company

In [57]:

F_Skills = emp_data['Skills'].value_counts()
print(F_Skills)

Out[57]:

Sales                     606
Content Writing           310
Research                  292
Employee relations        157
Workforce Management      102
Delivery                   80
Performance Management     52
Name: Skills, dtype: int64

objects = labels = ['Sales', 'Research', 'C_W', 'E_R','W_M','Delivery','Per_M']
y_pos = np.arange(len(objects))
performance = [242,114,85,51,47,33,16]
colors = ['green', 'brown', 'gray', 'yellow','blue','red','pink']
plt.bar(y_pos, performance, align='center', color = colors)
plt.xticks(y_pos, objects)
plt.ylabel('Number of Employees')
plt.title('Skills across company')
 
plt.show()

Unpaired T-Test

A Two sample t-test tests the mean of two groups against a known mean.

H₀= Mean of monthly incomes is same in Sales & Delivery skills

In [61]:

Sales = emp_data.loc[emp_data['Skills'] == 'Sales','MonthlyIncome(June-2018)']
Content_Writing = emp_data.loc[emp_data['Skills'] == 'Content Writing','MonthlyIncome(June-2018)']

Two T-test to check whether the salary mean for Sales Skills is different from Delivery Skills

In [63]:

stats.ttest_ind(a=Sales,
                b=Content_Writing,
                equal_var=False)

Out[63]:

Ttest_indResult(statistic=0.8498583140880666, pvalue=0.39570737102948617)

Insight

As p value is **greater ** TO 0.05 (0.49) ,

There is a no different Salaries

Paired T-Test

A Two sample t-test within a group at different points of time.

H₀ = Mean of Average Income for Females is same as in previous year(2018).

(Previous year Income Mean)μ₁
( This year Income Mean) μ₂ μ1 - μ2 = 0

In [64]:

f2018_income = emp_data.loc[emp_data['Gender'] == 'Female','MonthlyIncome(June-2018)']
f2019_income = emp_data.loc[emp_data['Gender'] == 'Female','MonthlyIncome(June-2019)']

In [65]:

stats.ttest_rel( a=f2018_income,
                 b=f2019_income
                )

Out[65]:

TtestResult(statistic=-11.958870987033315, pvalue=3.526405752958248e-30, df=716)

Insight

As p value is less than 0.05(8.09),so Reject null Hypothesis

T-Test

1.Introduction to T-Test

Terminologies in T-Test

Types of t-tests, and they are categorized as dependent and independent t-tests.

Why we do t-test?

1. Independent sample t-test

Formula:

Steps involved:

If tcal > ttable => p < (α=0.05) => significant difference between two groups found.

If tcal < ttable => p > (α=0.05) => no significant difference between two groups.

2. Paired sample t-test

We can use this when:

3. One sample t-test

use when the sample size is small. (under 30) data is collected randomly. data is approximately normally distributed.

Steps:

Analysis

Two Independent Sample T-test

What do you need to run an independent t-test?

Hypothesis Statement:

Which error would you say is more serious?

Importing Required Packages

let us run t_sample independent test to check significance between performance of G1(Female) and G2(Male)

Insight

Lets check if there is significant difference between males ,females & Work life Balance

Insight

Let's visualize the distribution of men according to WLF rating since this was significant

Let's visualize the distribution of women according to WLF rating since this was significant

Non-Parametric Test

Spearman

Why Spearman?

Lets use Spearman test Relation between Work Life Balance & Performance Rating

One-Sample T-Test

Hypothesis Statements

Lets Check significance of difference between given means

There is a significant difference

Hence we can conclude the average salary of employees donot matches population mean.

Two-Sample T-Test

Employee skill set across the company

Unpaired T-Test

Two T-test to check whether the salary mean for Sales Skills is different from Delivery Skills

There is a no different Salaries

Paired T-Test

Mean of Average Income for Females is different as in previous year (difference is significant)