Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Advanced Data Analysis using Python/2 Numpy for Linear Regression .ipynb
3074 views
Kernel: Python 3 (ipykernel)

What is Linear regression?

  • Linear regression is an approach for modeling the relationship between two (simple linear regression) or more variables (multiple linear regression).

  • In simple linear regression, one variable is considered the predictor or independent variable, while the other variable is viewed as the outcome or dependent variable. image.png

Examples

  • Predicting Exam scores based on study hours

  • Predicting Profit for Quantity Sold

  • Predicting Employe retention with years

WHY Linear Regression?

  • To find the parameters so that the model best fits the data.

  • Forecasting an effect

  • Determing a Trend

Assumptions of Linear Regression

  • Linear relationship. One of the most important assumptions is that a linear relationship is said to exist between the dependent and the independent variables

  • No auto-correlation or independence. The residuals (error terms) are independent of each other. In other words, there is no correlation between the consecutive error terms of the time series data

  • No Multicollinearity. The independent variables shouldn’t be correlated. If multicollinearity exists between the independent variables, it is challenging to predict the outcome of the model

  • Homoscedasticity. Homoscedasticity means the residuals have constant variance at every level of x. The absence of this phenomenon is known as heteroscedasticity

  • Normal distribution of error terms. The last assumption that needs to be checked for linear regression is the error terms’ normal distribution

Weight(Y) = b1(Height(x))+b0

bo=y-b1(height) b1=y-bo/x b1= b0

import numpy as np a=[1,23,43] a1=np.array(a) a1+a1
[1,2,3]+[4,5,6]

Mathematical Approach

Simple Linear Regression

  • Lets assume that the two variables are linearly related.

  • Find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).

    x = [9, 10, 11, 12, 10, 9, 9, 10, 12, 11] y = [10, 11, 14, 13, 15, 11, 12, 11, 13, 15]

  • x as feature vector, i.e x = [x_1, x_2, …., x_n],

  • y as response vector, i.e y = [y_1, y_2, …., y_n]

  • for n observations (in above example, n=10).

  • Now, the task is to find a line which fits best in above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in dataset) This line is called regression line.

image.png

Here,

h(xi) represents the predicted response value for ith observation. b(0) and b(1) are regression coefficients and represent y-intercept and slope of regression line respectively.

  • where (SSxx) is the sum of cross-deviations of y and x: image.png

image.png

import matplotlib.pyplot as plt import pandas as pd import numpy as np %matplotlib inline x = [1,2,3,4,7,8,9,10, 11, 12, 10, 9, 9, 10, 12, 11,13,14,15,17,18,19,20,21,22,24,25,27] ##- Age y = [4,6,12,16,18,19,20,21,21,18,20, 23, 24, 24, 24, 25,27,28,40,42,44,46,48,50,48,51,60,62] ##- weight plt.scatter(x,y, edgecolors='r') plt.xlabel('feature vector',color="y") plt.ylabel('response vector',color="b") plt.show()
Image in a Jupyter notebook

Converting X and Y into array using Numpy

import numpy as np x=np.array(x) y=np.array(y)

Creating a function to determine regression coef

def estimate_coef(x, y): # number of observations/points n = np.size(x) # mean of x and y vector m_x, m_y = np.mean(x), np.mean(y) # calculating cross-deviation and deviation about x SS_xy = np.sum(y*x) - n*m_y*m_x SS_xx = np.sum(x*x) - n*m_x*m_x # calculating regression coefficients b_1 = SS_xy / SS_xx b_0 = m_y - b_1*m_x return(b_0, b_1)

To Plot Regression line

def plot_regression_line(x, y, b): # plotting the actual points as scatter plot plt.scatter(x, y, color = "b", marker = "*", s = 50) # predicted response vector y_pred = b[0] + b[1]*x # plotting the regression line plt.plot(x, y_pred, color = "coral") # putting labels plt.xlabel('x') plt.ylabel('y') # function to show plot plt.show() def main(): # observations x = [1,2,3,4,7,8,9,10, 11, 12, 10, 9, 9, 10, 12, 11,13,14,15,17,18,19,20,21,22,24,25,27] ##- Age y = [4,6,12,16,18,19,20,21,21,18,20, 23, 24, 24, 24, 25,27,28,40,42,44,46,48,50,48,51,60,62] ##- weight x=np.array(x) y=np.array(y) # estimating coefficients b = estimate_coef(x, y) print("Estimated coefficients:\nb_0 = {} \n b_1 = {}".format(b[0], b[1])) # plotting regression line plot_regression_line(x, y, b) if __name__ == "__main__": main()
Estimated coefficients: b_0 = 1.3453817419580218 b_1 = 2.2130284055789957
Image in a Jupyter notebook

Conclusion

Weight= 2.21(Age) + 1.34

##Let us do Prediction: Age=30 Weight = 2.21*Age+1.34 Weight
67.64