Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Data Science using Python/Examples Hypothesis Testing .ipynb
3074 views
Kernel: Python 3 (ipykernel)

Two-sample t-test. This will allow us to test if there is a significant difference between the means of two groups. I'll provide some sample data and explain the steps along the way.

We will use the following Python libraries:

  • pandas (for data handling)

  • scipy.stats (for statistical tests)

  • seaborn and matplotlib (for visualizations)

Hypothesis:

Let’s assume we have two groups of students from different sections of a class, and we want to check if there is a significant difference in their test scores.

  • Null Hypothesis (H₀): The mean scores of both sections are equal.

  • Alternative Hypothesis (H₁): The mean scores of both sections are not equal.=.

# Import necessary libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy import stats # Create sample data np.random.seed(42) # For reproducibility # Scores for Section A (Normally distributed with mean=75, std=10) section_a_scores = np.random.normal(75, 10, 30) # Scores for Section B (Normally distributed with mean=80, std=12) section_b_scores = np.random.normal(80, 12, 30) # Creating a DataFrame for easy handling data = pd.DataFrame({ 'Scores': np.concatenate([section_a_scores, section_b_scores]), 'Section': ['A']*30 + ['B']*30 }) # Visualizing the data plt.figure(figsize=(8, 5)) sns.boxplot(x='Section', y='Scores', data=data) plt.title('Test Scores by Section') plt.show() # Perform a two-sample t-test t_stat, p_value = stats.ttest_ind(section_a_scores, section_b_scores) # Output the result print(f"T-Statistic: {t_stat:.3f}") print(f"P-Value: {p_value:.3f}") # Set significance level alpha = 0.05 # Hypothesis testing conclusion if p_value < alpha: print("We reject the null hypothesis (H₀). There is a significant difference between the two groups.") else: print("We fail to reject the null hypothesis (H₀). There is no significant difference between the two groups.")
  1. Paired t-test for dependent samples.

  2. ANOVA for comparing more than two groups.

  3. Chi-Square Test for categorical data.

  4. Paired t-test (for dependent samples) This is useful when you have two related groups (e.g., before and after measurements from the same subjects).

Hypothesis:

  • Null Hypothesis (H₀): The mean difference between the paired samples is 0.

  • Alternative Hypothesis (H₁): The mean difference between the paired samples is not 0.

# Simulating paired data (e.g., before and after treatment scores) np.random.seed(42) # Scores before treatment (mean=70, std=8) before_treatment = np.random.normal(70, 8, 30) # Scores after treatment (mean=75, std=8) after_treatment = before_treatment + np.random.normal(5, 5, 30) # Slight improvement after treatment # Paired t-test t_stat, p_value = stats.ttest_rel(before_treatment, after_treatment) # Output the result print(f"Paired T-Statistic: {t_stat:.3f}") print(f"Paired P-Value: {p_value:.3f}") # Set significance level alpha = 0.05 # Hypothesis testing conclusion if p_value < alpha: print("We reject the null hypothesis (H₀). The treatment had a significant effect.") else: print("We fail to reject the null hypothesis (H₀). The treatment did not have a significant effect.")

ANOVA (Analysis of Variance) (for comparing more than two groups)

ANOVA is useful when you want to compare the means of three or more independent groups.

Hypothesis:

  • Null Hypothesis (H₀): The means of all groups are equal.

  • Alternative Hypothesis (H₁): At least one group mean is different from the others.

# Simulating data for 3 groups (e.g., test scores from 3 different classes) np.random.seed(42) # Scores for Class A, B, and C (Normally distributed with different means and std) class_a_scores = np.random.normal(70, 10, 30) class_b_scores = np.random.normal(75, 10, 30) class_c_scores = np.random.normal(80, 10, 30) # Combine the data into a DataFrame data_anova = pd.DataFrame({ 'Scores': np.concatenate([class_a_scores, class_b_scores, class_c_scores]), 'Class': ['A']*30 + ['B']*30 + ['C']*30 }) # Perform one-way ANOVA f_stat, p_value = stats.f_oneway(class_a_scores, class_b_scores, class_c_scores) # Output the result print(f"F-Statistic: {f_stat:.3f}") print(f"ANOVA P-Value: {p_value:.3f}") # Set significance level alpha = 0.05 # Hypothesis testing conclusion if p_value < alpha: print("We reject the null hypothesis (H₀). At least one class has a significantly different mean score.") else: print("We fail to reject the null hypothesis (H₀). All class means are roughly the same.")

Chi-Square Test (for categorical data)

This test is used to examine the relationship between two categorical variables in a contingency table.

Hypothesis:

  • Null Hypothesis (H₀): The two categorical variables are independent.

  • Alternative Hypothesis (H₁): The two categorical variables are not independent.

Example: We want to test whether there is a significant relationship between gender (male/female) and preference for a product (yes/no).

# Creating a contingency table (2x2 table with Gender vs Product Preference) data_chi2 = pd.DataFrame({ 'Gender': ['Male', 'Male', 'Female', 'Female'], 'Preference': ['Yes', 'No', 'Yes', 'No'], 'Count': [20, 15, 30, 10] }) # Reshape data into a contingency table contingency_table = data_chi2.pivot(index='Gender', columns='Preference', values='Count') # Perform Chi-Square Test chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table) # Output the result print(f"Chi-Square Statistic: {chi2_stat:.3f}") print(f"Chi-Square P-Value: {p_value:.3f}") # Set significance level alpha = 0.05 # Hypothesis testing conclusion if p_value < alpha: print("We reject the null hypothesis (H₀). There is a significant relationship between gender and product preference.") else: print("We fail to reject the null hypothesis (H₀). There is no significant relationship between gender and product preference.")

Summary of Hypothesis Tests and Use Cases

Test TypeUse CasePython Function
Two-Sample t-testCompare means of two independent groups (e.g., test scores of two sections).stats.ttest_ind()
Paired t-testCompare means of two related groups (e.g., before and after measurements from the same individuals).stats.ttest_rel()
ANOVACompare means of three or more independent groups (e.g., test scores from multiple classes).stats.f_oneway()
Chi-Square TestTest the association or independence between categorical variables (e.g., gender vs product preference).stats.chi2_contingency()

Test Descriptions:

  1. Two-Sample t-test:

    • Used when comparing the means of two independent groups.

    • Assumes data is normally distributed and variances are equal.

    • Example: Testing whether the mean scores of students from two different sections are significantly different.

  2. Paired t-test:

    • Used when comparing means from the same group at different times or under different conditions.

    • Example: Testing if there is a significant improvement in test scores before and after a training program.

  3. ANOVA (Analysis of Variance):

    • Used when comparing the means of three or more independent groups.

    • Example: Comparing test scores from students in three different classes to see if there is a significant difference between the classes.

  4. Chi-Square Test:

    • Used to examine the relationship between two categorical variables.

    • Example: Testing if gender is associated with product preference (Yes/No) in a sample population.