T-Test
2.1 Two Independent Sample T-Test
2.1.1 Hypothesis Statement
2.1.2 Implementation
2.2.1 Spearman Coefficient
2.2.2 Implementation
2.3 One Sample Test
2.3.1 Hypothesis
2.3.2 Implementation
2.4 Two Sample Test
1.Introduction to T-Test
The T-test is a statistical significance test used to determine whether a numeric data sample differs significantly from the population or whether two groups have different average values (for example, whether men and women have different average heights). Statistical significance is determined by the size of the difference between the group averages, the sample size, and the standard deviations of the groups.
To determine if there is a significant difference between the means of two groups, which may be related in certain features.
Terminologies in T-Test
Degree of freedom (df) – It tells us the number of independent variables used for calculating the estimate between sample groups
df=sum(size of the sample S)- 1
***if we have 1000 data points in samples the df = 1000-1 = 999
for two samples:
The df would be calculated as (nA is size of sample A and nB is size of sample B)
Significance level (α) – It is the probability of rejecting the null hypothesis when it is true. In simpler terms, it tells us about the percentage of risk involved in saying that a difference exists between two groups, when in reality it does not.
Types of t-tests, and they are categorized as dependent and independent t-tests.
Independent samples t-test: compares the means for two groups.
Paired sample t-test: compares means from the same group at different times (say, one year apart).
One sample t-test test: the mean of a single group against a known mean.
Why we do t-test?
To make inference about population beyond our data
To check whether the difference between means of two Samples is reliable
Comparing Machine Learning Algorithms
1. Independent sample t-test
Formula:
where,
t = t-value
A = Sample of A
B = Sample of B
μA = Mean of sample A
μB = Mean of sample B
nA = sample size of A
nB = sample size of B
df = degree of freedom
Steps involved:
Step 1 - Find the sum of all values in each sample.
Step 2 - Square the sum values found in step 1.
Step 3 - Find the sum of square of individual values in each sample.
Step 4 - Calculate the mean of each sample.
Step 5 - Find the degree of freedom (df) using Eq-2.
Step 6 - Insert all the values found in Steps 1-4 into Eq-3 and find the calculated t-value.
Step 7 - Use the values of df and α (take α = 0.05 if not given) in the two-tails t-table
Step 8 - Compare values of t found in Step-6 and Step-7
If tcal > ttable => p < (α=0.05) => significant difference between two groups found.
If tcal < ttable => p > (α=0.05) => no significant difference between two groups.
2. Paired sample t-test
Paired sample t-test, commonly known as dependent sample t-test is used to find out if the difference in the mean of two samples is 0.
The test is done on dependent samples, usually focusing on a particular group of people or thing. In this, each entity is measured twice, resulting in a pair of observations.
We can use this when:
Two similar (twin like) samples are given.
The dependent variable (data) is continuous.
The observations are independent of one another.
The dependent variable is approximately normally distributed.
3. One sample t-test
One sample t-test is one of the widely used t-tests for comparison of the sample mean of the data to a particularly given value. Used for comparing the sample mean to the true/population mean.
use when the sample size is small. (under 30) data is collected randomly. data is approximately normally distributed.
Formula
Steps:
Step 1 - Define the null (h0) and alternative (h1) hypothesis.
Step 2 - Calculate sample mean. (if not given) [population mean, standard deviation, n is given]
Step 3 - Put the values found in Step 1 into Eq-5 and calculate t-value. (tcal)
Step 4 - Calculate degree of freedom (df). (same as done in paired sample t-test)
Step 5 - Take α = 0.05 if not given. Use the value of df and α and find ttable from one tailed t-table.
Step 6 - Compare values of t found in Step-3 and Step-5.
Analysis
Two Independent Sample T-test
The independent t-test is used to test whether two groups means are significantly different from each other, using the means from randomly drawn samples.
What do you need to run an independent t-test?
In order to run an independent t-test, you need the following: One independent, categorical variable that has two levels/groups. One continuous dependent variable.
Hypothesis Statement:
The null hypothesis for the independent t-test is that the population means from the two unrelated groups are equal: H0: u1 = u2
From a sample of Employee data(fictious data) lets find signifance of relationship between Gender,Performance & Work life balance in company.
H0 : mean1=mean2 #Mean of two groups is same
HA : mean1!mean2 # Two groups are significantly different
Which error would you say is more serious?
A false positive (type I error) — when you reject a true null hypothesis
A false negative (type II error) — when you accept a false null hypothesis
Importing Required Packages
let us run t_sample independent test to check significance between performance of G1(Female) and G2(Male)
H0 : Performance is same for G1 and G2
HA: There is difference in performance
Insight
AS t stas is less than 1 and P value greater than 0.05 so we failed to reject null Hypothesis
Lets check if there is significant difference between males ,females & Work life Balance
Insight
T value and p value supports to reject null that means there is a significance difference between WFH for G1 and G2
Let's visualize the distribution of men according to WLF rating since this was significant
Let's visualize the distribution of women according to WLF rating since this was significant
Non-Parametric Test
This is test in which even looking at data we don't have any Idea about polulation parameter.
Spearman
Measures the strength of association between two variables and direction of relationship (positive , negative or no-relation).
Why Spearman?
We run spearman's r correlations to find relation between variables
This test is applicable for norminal & ordinal categorical data.
The correlation coefficient is a statistical measure that calculates the strength of the relationship between the relative movements of the two variables.
The range of values for the correlation coefficient bounded by 1.0 on an absolute value basis or between -1.0 to 1.0.
H0 : There is no association between WorkLife Balance & Performance Rating
Lets use Spearman test Relation between Work Life Balance & Performance Rating
-1,0,1 negative trend, no asso, postive trend
Insight
Negative Correlation between Work Life Balance & Performance Rating
Favours Alternate hypothesis
Insight
no Correlation between Skills & Monthly Income
Favours Null hypothesis
One-Sample T-Test
Lets Check significance of difference between given means
Insight
As p -value less than 0.05 so Reject null
There is a significant difference
Hence we can conclude the average salary of employees donot matches population mean.
Two-Sample T-Test
Employee skill set across the company
Two T-test to check whether the salary mean for Sales Skills is different from Delivery Skills
Insight
As p value is **greater ** TO 0.05 (0.49) ,
There is a no different Salaries
Insight
As p value is less than 0.05(8.09),so Reject null Hypothesis