CoCalc -- 1.3 Stats Basics.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Data Science Essentials for Data Analysts/1.3 Stats Basics.ipynb
³⁰⁷⁹ views

Kernel: Python 3 (ipykernel)

Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. In other words, it is a mathematical discipline to collect, summarize data

# Mean,Mode,Median


1,2,3,4,5,-1,9,-8,1

Median

-8,-1,1,,2,3,4,5,9,10

A-60,B-12,C-65,D-85,E-75,F-null : Test Avearge Marks
12,60,65,75,85,90,68 : Median =65,mean=59.4

In [ ]:

(60+65+85+75)/4

In [ ]:

import statistics as m
 
# initializing list
data = [1,2,8,11,13,33,76,75]
 
print ("The average of list values is : ",end="")
print (m.mean(data))

Median

It is the middle value of the data set. It splits the data into two halves.

The median() function is used to calculate the median, i.e middle element of data. If the passed argument is empty, StatisticsError is raised.

In [ ]:

# Python code to demonstrate the
# working of median() on various
# range of data-sets

# importing the statistics module
from statistics import median

# Importing fractions module as fr
from fractions import Fraction as fr

# tuple of positive integer numbers
s1 = (12, 13, 14, 15, 17, 9)

# tuple of floating point values
s2 = (2.4, 5.1, 6.7, 8.9)

# tuple of fractional numbers
s3 = (fr(1, 2), fr(44, 12),fr(10, 3), fr(2, 3))

# tuple of a set of negative integers
s4 = (-5, -1, -12, -19, -3)

# tuple of set of positive
# and negative integers
s5 = [-11, -2, -3, -4, 4, 3, 2, 1]

# Printing the median of above datasets
print("Median of s1 is ", (median(s1)))
print("Median of s2 is ", (median(s2)))
print("Median of s3 is ",(median(s3)))
print("Median of s4 is ",(median(s4)))
print("Median of s5 is ", (median(s5)))

-11,-4,-3,-2,1,2,3,4
(-1/2)

Mode

It is the value that has the highest frequency in the given data set. The data set may have no mode if the frequency of all data points is the same. Also, we can have more than one mode if we encounter two or more data points having the same frequency.

The mode() function returns the number with the maximum number of occurrences. If the passed argument is empty, StatisticsError is raised.

In [ ]:


from statistics import mode

from fractions import Fraction as fr

# tuple of a set of floating point values
S2 = ("A", "A", "C", "D")

# tuple of a set of fractional numbers
S3 = (fr(1, 2), fr(1, 2), fr(10, 3), fr(2, 3))

# tuple of a set of negative integers
S4 = (-1, 7, 7, 2, 2, 8, -9)

# tuple of strings
S5 = ("Ashi", "Ashi", "Rear", "blog", "Red", "rear", "rear","rear")

# Printing out the mode of the above data-sets
print("Mode of data set 2 ",(mode(S2)))
print("Mode of data set 3", (mode(S3)))
print("Mode of data set 4 is % s" % (mode(S4)))
print("Mode of data set 5 is % s" % (mode(S5)))

mode(("E","B","C"))

In [ ]:

(1+2+3+4)/4

A - 200
B-  100

1. Range = 200-100 = 100



2. 1,2,3,4 
mean=(2.5-1)+(2-2.5)+

mean= 2
(1-2)+(2-2)+(3-2)

Measure of Variability

Till now, we have studied the measure of central tendency but this alone is not sufficient to describe the data. To overcome this we need the measure of variability. The measure of variability is known as the spread of data or how well our data is distributed. The most common variability measures are:

Range
Variance
Standard deviation

Range

The difference between the largest and smallest data point in our data set is known as the range. The range is directly proportional to the spread of data which means the bigger the range, the more the spread of data and vice versa.

Range = Largest data value – smallest data value

We can calculate the maximum and minimum values using the max() and min() methods respectively.

In [ ]:

# Sample Data
age = [23, 24, 34, 45, 56,76,42]
 
#Finding Max
Max_Age = max(age)
# Finding Min
Min_Age = min(age)
 
# Difference Of Max and Min
Range = Max_Age-Min_Age   
print("Max_Age = {}, Min_Age = {} and Range = {}".format(Max_Age, Min_Age, Range))

Variance

It is defined as an average squared deviation from the mean. It is calculated by finding the difference between every data point and the average which is also known as the mean, squaring them, adding all of them, and then dividing by the number of data points present in our data set.

The statistics module provides the variance() method that does all the maths behind the scene. If the passed argument is empty, StatisticsError is raised.

Example: Python code to calculate Variance

In [ ]:

# Python code to demonstrate variance()

# importing statistics module
from statistics import variance

# importing fractions as parameter values
from fractions import Fraction as fr

# tuple of a set of positive integers
# numbers are spread apart but not very much
sample1 = (1, 2, 5, 4, 8, 9, 12)

# tuple of a set of negative integers
sample2 = (-2, -4, -3, -1, -5, -6)

# tuple of a set of positive and negative numbers
# data-points are spread apart considerably
sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)

# tuple of a set of fractional numbers
sample4 = (fr(1, 2), fr(2, 3), fr(3, 4),
		fr(5, 6), fr(7, 8))

# tuple of a set of floating point values
sample5 = (1.23, 1.45, 2.1, 2.2, 1.9)

# Print the variance of each samples
print("Variance of Sample1 is % s " % (variance(sample1)))
print("Variance of Sample2 is % s " % (variance(sample2)))
print("Variance of Sample3 is % s " % (variance(sample3)))
print("Variance of Sample4 is % s " % (variance(sample4)))
print("Variance of Sample5 is % s " % (variance(sample5)))

Standard Deviation

It is defined as the square root of the variance. It is calculated by finding the Mean, then subtracting each number from the Mean which is also known as the average, and squaring the result. Adding all the values and then dividing by the no of terms followed by the square root.

The stdev() method of the statistics module returns the standard deviation of the data. If the passed argument is empty, StatisticsError is raised.

Example: Python code to calculate Standard Deviation

In [ ]:

## 300, 200,350, 450, , 500

(300+200+350+450+500)/5

In [ ]:

# Python code to demonstrate stdev()
# function on various range of datasets

# importing the statistics module
from statistics import stdev

# importing fractions as parameter values
from fractions import Fraction as fr

# creating a varying range of sample sets
# numbers are spread apart but not very much
s1 = (11, 21, 5.5, 42, 81, 19, 12)

# tuple of a set of negative integers
s2 = (-12, -14, -13, -11, -15, -26)

# Print the standard deviation of
# following sample sets of observations
print("The Standard Deviation of Sample1 is % s"
      % (stdev(sample1)))
 
print("The Standard Deviation of Sample2 is % s"
      % (stdev(sample2)))

Python – Normal Distribution in Statistics

A probability distribution determines the probability of all the outcomes a random variable takes. The distribution can either be continuous or discrete distribution depending upon the values that a random variable takes. There are several types of probability distribution like Normal distribution, Uniform distribution, exponential distribution, etc. In this article, we will see about Normal distribution and we will also see how we can use Python to plot the Normal distribution.

What is Normal Distribution

The normal distribution is a continuous probability distribution function also known as Gaussian distribution which is symmetric about its mean and has a bell-shaped curve. It is one of the most used probability distributions. Two parameters characterize it

Mean(μ)- It represents the center of the distribution

Standard Deviation(σ) – It represents the spread in the curve

The formula for Normal distribution is

Properties Of Normal Distribution

Symmetric distribution – The normal distribution is symmetric about its mean point. It means the distribution is perfectly balanced toward its mean point with half of the data on either side.

Bell-Shaped curve – The graph of a normal distribution takes the form bell-shaped curve with most of the points accumulated at its mean position. The shape of this curve is determined by the mean and standard deviation of the distribution

Empirical Rule – The normal distribution curve follows the empirical rule where 68% of the data lies within 1 standard deviation from the mean of the graph, 95% of the data lies within 2 standard deviations from the mean and 97% of the data lies within 3 standard deviations from the mean.

Python code for plotting Normal Distribution

In [ ]:

import numpy as np
import matplotlib.pyplot as plt

# Mean of the distribution
Mean = 200

# satndard deviation of the distribution
Standard_deviation = 1

# size
size = 200000

# creating a normal distribution data
values = np.random.normal(Mean, Standard_deviation, size)

# plotting histograph
plt.hist(values, 100)
# plotting mean line
plt.axvline(values.mean(), color='y', linestyle='dashed', linewidth=2)
plt.show()

Inferential Statistics

is a branch of statistics that allows us to make predictions, generalizations, or inferences about a population based on a sample of data. It goes beyond merely describing data (descriptive statistics) to drawing conclusions about larger groups.

Key Concepts in Inferential Statistics

Population vs. Sample:

Population: The entire group of interest (e.g., all employees in a company).
Sample: A subset of the population used to make inferences (e.g., 100 employees chosen randomly).
Use sample data to make educated guesses about the population.
Quantify uncertainty in these guesses.

Hypothesis Testing:

Test assumptions (hypotheses) about a population. Example: "The average salary of employees is $60,000."

Confidence Intervals:

Provide a range of values within which the true population parameter likely lies. Example: "We are 95% confident the average salary is between $58,000 and $62,000."

Statistical Significance:

Measures whether the observed effect is unlikely due to chance. Example: If the p-value is below 0.05, the result is statistically significant.

Estimation:

Estimate population parameters (e.g., mean, proportion) from sample data.

Examples of Inferential Statistics

Election Polling:

Use a sample of voters to predict the outcome of an election.

Clinical Trials:

Test the effectiveness of a new drug on a small group and infer results for the entire population.

Market Research:

Survey 1,000 customers to estimate customer satisfaction across all customers.

- Null Statement:

Ho:There is no change in average rating of empl in ABC organization
   There is no difference in salaries of male and female in an organization
   There  is no increase in revenue fo r 2024 2mb 2023.

Ha: 2023:3 , 2024: 4

Example of Inferential Statistics

Scenario: A company wants to compare the average sales performance of two teams (Team A and Team B) to determine if there is a significant difference in their mean sales.

Data:

Team A and Team B sales data (monthly sales in dollars):
Team A: [250, 270, 260, 280, 300, 290, 310]
Team B: [275, 290, 280, 305, 315, 295, 320]

We can perform an independent two-sample t-test to compare their means.

In [3]:

import numpy as np
from scipy import stats

# Sales data
team_a = [250, 270, 260, 280, 300, 290, 310]
team_b = [275, 290, 280, 305, 315, 295, 320]

# Perform two-sample t-test
t_stat, p_value = stats.ttest_ind(team_a, team_b)

# Display results
alpha = 0.05 #95%
print(f"T-Statistic: {t_stat:.2f}")
print(f"P-Value: {p_value:.3f}")

if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference between the mean sales of the two teams.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the mean sales of the two teams.")

Out[3]:

T-Statistic: -1.65
P-Value: 0.125
Fail to reject the null hypothesis: No significant difference between the mean sales of the two teams.

Output Explanation:

T-Statistic: Measures the size of the difference relative to the variation in your sample data.
P-Value: If this value is less than the significance level (α = 0.05), the null hypothesis is rejected.

Inferentail Tests

Test Name	Purpose	Example
One-Sample T-Test	Compare the mean of a single sample to a known population mean.	Testing if the average height of students is 170 cm when the population mean is 172 cm.
Independent Two-Sample T-Test	Compare the means of two independent groups to see if they are significantly different.	Comparing the average test scores of two different classes.
Paired T-Test	Compare means from the same group at different times (paired observations).	Measuring the weight of individuals before and after a diet program.
ANOVA (Analysis of Variance)	Compare means across three or more groups to see if at least one is different.	Comparing the effectiveness of three different teaching methods on student performance.
Chi-Square Test of Independence	Assess whether two categorical variables are independent.	Determining if there is an association between gender and voting preference.
Chi-Square Goodness of Fit Test	Determine if a sample matches a population with a specific distribution.	Checking if the distribution of colors in a bag of candies matches the expected distribution.
Regression Analysis	Examine the relationship between a dependent variable and one or more independent variables.	Predicting house prices based on features like size, location, and number of bedrooms.
Logistic Regression	Predict a binary outcome based on one or more predictor variables.	Determining whether a customer will buy a product (Yes/No) based on age, income, and browsing history.
Mann-Whitney U Test	Compare differences between two independent groups when the dependent variable is ordinal or not normally distributed.	Comparing customer satisfaction ratings between two different stores.
Wilcoxon Signed-Rank Test	Compare two related samples to assess whether their population mean ranks differ.	Comparing pre-test and post-test scores of the same group of students.
Kruskal-Wallis H Test	Compare three or more independent groups on an ordinal dependent variable.	Assessing the impact of different diets on weight loss across multiple groups.
Pearson Correlation	Measure the linear relationship between two continuous variables.	Analyzing the correlation between hours studied and exam scores.
Spearman Rank Correlation	Measure the monotonic relationship between two ranked variables.	Assessing the relationship between customer satisfaction ranks and product quality ranks.
Two-Way ANOVA	Examine the effect of two different categorical independent variables on one continuous dependent variable, including interaction effects.	Studying the impact of both teaching method and study time on student performance.
Factor Analysis	Identify underlying variables (factors) that explain the pattern of correlations within a set of observed variables.	Reducing a large number of survey items into key factors representing customer satisfaction.
Time Series Analysis	Analyze data points collected or recorded at specific time intervals to identify trends, seasonal patterns, and other temporal structures.	Forecasting monthly sales data to predict future sales trends.