Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Data Science using Python/Day 1 Stats Basics.ipynb
3074 views
Kernel: Python 3 (ipykernel)

Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. In other words, it is a mathematical discipline to collect, summarize data

image.png

image.png

# Mean,Mode,Median 1,2,3,45,23 1,2,3,23,45
(1+2+3+45+23)/5
import statistics as m # initializing list data = [1,2,8,11,13,33,76,75] print ("The average of list values is : ",end="") print (m.mean(data))

Median

It is the middle value of the data set. It splits the data into two halves.

The median() function is used to calculate the median, i.e middle element of data. If the passed argument is empty, StatisticsError is raised.

# Python code to demonstrate the # working of median() on various # range of data-sets # importing the statistics module from statistics import median # Importing fractions module as fr from fractions import Fraction as fr # tuple of positive integer numbers s1 = (12, 13, 14, 15, 17, 9) # tuple of floating point values s2 = (2.4, 5.1, 6.7, 8.9) # tuple of fractional numbers s3 = (fr(1, 2), fr(44, 12),fr(10, 3), fr(2, 3)) # tuple of a set of negative integers s4 = (-5, -1, -12, -19, -3) # tuple of set of positive # and negative integers s5 = (-11, -2, -3, -4, 4, 3, 2, 1) # Printing the median of above datasets print("Median of s1 is % s" % (median(s1))) print("Median of s2 is % s" % (median(s2))) print("Median of s3 is % s" % (median(s3))) print("Median of s4 is % s" % (median(s4))) print("Median of s5 is % s" % (median(s5)))

Mode

It is the value that has the highest frequency in the given data set. The data set may have no mode if the frequency of all data points is the same. Also, we can have more than one mode if we encounter two or more data points having the same frequency.

The mode() function returns the number with the maximum number of occurrences. If the passed argument is empty, StatisticsError is raised.

# Importing the statistics module from statistics import mode # Importing fractions module as fr # Enables to calculate harmonic_mean of a # set in Fraction from fractions import Fraction as fr # tuple of positive integer numbers S1 = (2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7) # tuple of a set of floating point values S2 = (2.4, 1.3, 1.3, 1.3, 2.4, 4.6) # tuple of a set of fractional numbers S3 = (fr(1, 2), fr(1, 2), fr(10, 3), fr(2, 3)) # tuple of a set of negative integers S4 = (-1, -2, -2, -2, -7, -7, -9) # tuple of strings S5 = ("Ashi", "Ashi", "rear", "blog", "Red", "rear", "rear","rear") # Printing out the mode of the above data-sets print("Mode of data set 1 is % s" % (mode(S1))) print("Mode of data set 2 is % s" % (mode(S2))) print("Mode of data set 3 is % s" % (mode(S3))) print("Mode of data set 4 is % s" % (mode(S4))) print("Mode of data set 5 is % s" % (mode(S5)))

Measure of Variability

Till now, we have studied the measure of central tendency but this alone is not sufficient to describe the data. To overcome this we need the measure of variability. The measure of variability is known as the spread of data or how well our data is distributed. The most common variability measures are:

Range Variance Standard deviation

Range

The difference between the largest and smallest data point in our data set is known as the range. The range is directly proportional to the spread of data which means the bigger the range, the more the spread of data and vice versa.

Range = Largest data value – smallest data value

We can calculate the maximum and minimum values using the max() and min() methods respectively.

# Sample Data age = [23, 24, 34, 45, 56,76,42] #Finding Max Max_Age = max(age) # Finding Min Min_Age = min(age) # Difference Of Max and Min Range = Max_Age-Min_Age print("Max_Age = {}, Min_Age = {} and Range = {}".format(Max_Age, Min_Age, Range))

Variance

It is defined as an average squared deviation from the mean. It is calculated by finding the difference between every data point and the average which is also known as the mean, squaring them, adding all of them, and then dividing by the number of data points present in our data set.

image.png

The statistics module provides the variance() method that does all the maths behind the scene. If the passed argument is empty, StatisticsError is raised.

Example: Python code to calculate Variance

# Python code to demonstrate variance() # importing statistics module from statistics import variance # importing fractions as parameter values from fractions import Fraction as fr # tuple of a set of positive integers # numbers are spread apart but not very much sample1 = (1, 2, 5, 4, 8, 9, 12) # tuple of a set of negative integers sample2 = (-2, -4, -3, -1, -5, -6) # tuple of a set of positive and negative numbers # data-points are spread apart considerably sample3 = (-9, -1, -0, 2, 1, 3, 4, 19) # tuple of a set of fractional numbers sample4 = (fr(1, 2), fr(2, 3), fr(3, 4), fr(5, 6), fr(7, 8)) # tuple of a set of floating point values sample5 = (1.23, 1.45, 2.1, 2.2, 1.9) # Print the variance of each samples print("Variance of Sample1 is % s " % (variance(sample1))) print("Variance of Sample2 is % s " % (variance(sample2))) print("Variance of Sample3 is % s " % (variance(sample3))) print("Variance of Sample4 is % s " % (variance(sample4))) print("Variance of Sample5 is % s " % (variance(sample5)))

Standard Deviation

It is defined as the square root of the variance. It is calculated by finding the Mean, then subtracting each number from the Mean which is also known as the average, and squaring the result. Adding all the values and then dividing by the no of terms followed by the square root.

image.png

The stdev() method of the statistics module returns the standard deviation of the data. If the passed argument is empty, StatisticsError is raised.

Example: Python code to calculate Standard Deviation

# Python code to demonstrate stdev() # function on various range of datasets # importing the statistics module from statistics import stdev # importing fractions as parameter values from fractions import Fraction as fr # creating a varying range of sample sets # numbers are spread apart but not very much s1 = (11, 21, 5.5, 42, 81, 19, 12) # tuple of a set of negative integers s2 = (-12, -14, -13, -11, -15, -26) # Print the standard deviation of # following sample sets of observations print("The Standard Deviation of Sample1 is % s" % (stdev(sample1))) print("The Standard Deviation of Sample2 is % s" % (stdev(sample2)))

Python – Normal Distribution in Statistics

A probability distribution determines the probability of all the outcomes a random variable takes. The distribution can either be continuous or discrete distribution depending upon the values that a random variable takes. There are several types of probability distribution like Normal distribution, Uniform distribution, exponential distribution, etc. In this article, we will see about Normal distribution and we will also see how we can use Python to plot the Normal distribution.

What is Normal Distribution

The normal distribution is a continuous probability distribution function also known as Gaussian distribution which is symmetric about its mean and has a bell-shaped curve. It is one of the most used probability distributions. Two parameters characterize it

Mean(μ)- It represents the center of the distribution

Standard Deviation(σ) – It represents the spread in the curve

The formula for Normal distribution is

image.png

Properties Of Normal Distribution

Symmetric distribution – The normal distribution is symmetric about its mean point. It means the distribution is perfectly balanced toward its mean point with half of the data on either side.

Bell-Shaped curve – The graph of a normal distribution takes the form bell-shaped curve with most of the points accumulated at its mean position. The shape of this curve is determined by the mean and standard deviation of the distribution

Empirical Rule – The normal distribution curve follows the empirical rule where 68% of the data lies within 1 standard deviation from the mean of the graph, 95% of the data lies within 2 standard deviations from the mean and 97% of the data lies within 3 standard deviations from the mean.

image-2.png

Python code for plotting Normal Distribution

import numpy as np import matplotlib.pyplot as plt # Mean of the distribution Mean = 250 # satndard deviation of the distribution Standard_deviation = 5.5 # size size = 200000 # creating a normal distribution data values = np.random.normal(Mean, Standard_deviation, size) # plotting histograph plt.hist(values, 100) # plotting mean line plt.axvline(values.mean(), color='y', linestyle='dashed', linewidth=2) plt.show()