Table of Contents
2.1 Two Independent Sample T-Test
2.1.1 Hypothesis Statement
2.1.2 Implementation
2.2.1 Spearman Coefficient
2.2.2 Implementation
2.3 One Sample Test
2.3.1 Hypothesis
2.3.2 Implementation
2.4 Two Sample Test
1.Introduction to T-Test
The T-test is a statistical significance test used to determine whether a numeric data sample differs significantly from the population or whether two groups have different average values (for example, whether men and women have different average heights). Statistical significance is determined by the size of the difference between the group averages, the sample size, and the standard deviations of the groups.
To determine if there is a significant difference between the means of two groups, which may be related in certain features.
Terminologies in T-Test
Degree of freedom (df) – It tells us the number of independent variables used for calculating the estimate between 2 sample groups
df=sum(size of the sample S)- 1 Equation - 2
for two samples:
The df would be calculated as
Significance level (α) – It is the probability of rejecting the null hypothesis when it is true. In simpler terms, it tells us about the percentage of risk involved in saying that a difference exists between two groups, when in reality it does not.
Types of t-tests, and they are categorized as dependent and independent t-tests.
Independent samples t-test: compares the means for two groups.
Paired sample t-test: compares means from the same group at different times (say, one year apart).
One sample t-test test: the mean of a single group against a known mean.
Why we do t-test?
To make inference about population beyond our data
To check whether the difference between means of two Samples is reliable
Comparing Machine Learning Algorithms
1. Independent sample t-test
Formula:
where,
t = t-value
A = Sample of A
B = Sample of B
μA = Mean of sample A
μB = Mean of sample B
nA = samele size of A
nB = sample size of B
df = degree of freedom
Steps involved:
Step 1 - Find the sum of all values in each sample.
Step 2 - Square the sum values found in step 1.
Step 3 - Find the sum of square of individual values in each sample.
Step 4 - Calculate the mean of each sample.
Step 5 - Find the degree of freedom (df) using Eq-2.
Step 6 - Insert all the values found in Steps 1-4 into Eq-3 and find the calculated t-value.
Step 7 - Use the values of df and α (take α = 0.05 if not given) in the two-tails t-table
Step 8 - Compare values of t found in Step-6 and Step-7
If tcal > ttable => p < (α=0.05) => significant difference between two groups found.
If tcal < ttable => p > (α=0.05) => no significant difference between two groups.
2. Paired sample t-test
Paired sample t-test, commonly known as dependent sample t-test is used to find out if the difference in the mean of two samples is 0.
The test is done on dependent samples, usually focusing on a particular group of people or thing. In this, each entity is measured twice, resulting in a pair of observations.
We can use this when:
Two similar (twin like) samples are given.
The dependent variable (data) is continuous.
The observations are independent of one another.
The dependent variable is approximately normally distributed.
3. One sample t-test
One sample t-test is one of the widely used t-tests for comparison of the sample mean of the data to a particularly given value. Used for comparing the sample mean to the true/population mean.
use when the sample size is small. (under 30) data is collected randomly. data is approximately normally distributed.
Formula
Steps:
Step 1 - Define the null (h0) and alternative (h1) hypothesis.
Step 2 - Calculate sample mean. (if not given) [population mean, standard deviation, n is given]
Step 3 - Put the values found in Step 1 into Eq-5 and calculate t-value. (tcal)
Step 4 - Calculate degree of freedom (df). (same as done in paired sample t-test)
Step 5 - Take α = 0.05 if not given. Use the value of df and α and find ttable from one tailed t-table.
Step 6 - Compare values of t found in Step-3 and Step-5.
Analysis
Two Independent Sample T-test
The independent t-test is used to test whether population means are significantly different from each other, using the means from randomly drawn samples.
What do you need to run an independent t-test?
In order to run an independent t-test, you need the following: One independent, categorical variable that has two levels/groups. One continuous dependent variable.
Hypothesis Statement:
The null hypothesis for the independent t-test is that the population means from the two unrelated groups are equal: H0: u1 = u2
From a sample of Employee data(fictious data) lets find signifance of relationship between Gender,Performance & Work life balance in company.
H0 : mean1=mean2
HA : mean1!mean2
Which error would you say is more serious?
A false positive (type I error) — when you reject a true null hypothesis
A false negative (type II error) — when you accept a false null hypothesis
Test statistic.
The test statistic is a t statistic (t) defined bythe following equation. t = (x - μ) / SE SE =s *sqrt{ ( 1/n ) * [ ( N - n ) / ( N - 1 ) ] } where x is the sample mean, μ is the hypothesized population mean in the null hypothesis,and SE is the standard error.
Importing Required Packages
t= diff blw Males and females/ diff within males and females(ratings) t>1 t=0 t=1
Insight
The t score(Statistic) is a ratio between the difference between two groups and the difference within the groups.If the t value is high, it means that the 'net' difference between the scores for EACH participant is relatively large
As p value is less than 0.05 there is Significant difference between Genders and Performance Rating.
Reject Null Hypotheis : So the difference is significant
Lets check if there is significant difference between males ,females & Work life Balance
Insight
As p value is less than 0.05(0.005) ,so Gender and Work life Balance there is ** THERE IS A significant difference** between two groups.
Let's visualize the distribution of men according to Performance rating since this was significant
Let's visualize the distribution of women according to Performance rating since this was significant
Non-Parametric Test
This is test in which even looking at data we don't have any Idea about polulation parameter.
Spearman
Measures the strength of association between two variables and direction of relationship (positive , negative or no-relation).
Why Spearman?
We run spearman's r correlations to find relation between variables
This test is applicable for norminal & ordinal categorical data.
The correlation coefficient is a statistical measure that calculates the strength of the relationship between the relative movements of the two variables.
The range of values for the correlation coefficient bounded by 1.0 on an absolute value basis or between -1.0 to 1.0.
H0 : There is no association between WorkLife Balance & Performance Rating
Lets use Spearman test Relation between Work Life Balance & Performance Rating
Insight
No significant association between Work Life Balance & Performance Rating
Favours Alternate hypothesis
Insight
Negative Correlation between Skills & Monthly Income
Favours Null hypothesis
One-Sample T-Test
Lets Check significance of difference between given means
Insight
As p -value less than 0.05 so Reject null
There is a significant difference
Hence we can conclude the average salary of employees donot matches population mean.
Two-Sample T-Test
Employee skill set across the company
Two T-test to check whether the salary mean for Sales Skills is different from Delivery Skills
Insight
As p value is *greater ** TO 0.05 (0.49) , so fail to reject null Hypothesis
Mean Salary of Employees in Sales & Delivery has not a Significant difference
Insight
As p value is less than 0.05(8.09),so Reject null Hypothesis