What is Linear regression?

Linear regression is an approach for modeling the relationship between two (simple linear regression) or more variables (multiple linear regression).
In simple linear regression, one variable is considered the predictor or independent variable, while the other variable is viewed as the outcome or dependent variable.

Examples

Predicting Exam scores based on study hours
Predicting Profit for Quantity Sold
Predicting Employe retention with years

WHY Linear Regression?

To find the parameters so that the model best fits the data.
Forecasting an effect
Determing a Trend

Assumptions of Linear Regression

Linear relationship. One of the most important assumptions is that a linear relationship is said to exist between the dependent and the independent variables
No auto-correlation or independence. The residuals (error terms) are independent of each other. In other words, there is no correlation between the consecutive error terms of the time series data
No Multicollinearity. The independent variables shouldn’t be correlated. If multicollinearity exists between the independent variables, it is challenging to predict the outcome of the model
Homoscedasticity. Homoscedasticity means the residuals have constant variance at every level of x. The absence of this phenomenon is known as heteroscedasticity
Normal distribution of error terms. The last assumption that needs to be checked for linear regression is the error terms’ normal distribution

Weight(Y) = b1(Height(x))+b0

bo=y-b1(height) b1=y-bo/x b1= b0

In [ ]:

import numpy as np
a=[1,23,43]
a1=np.array(a)

a1+a1

In [ ]:

[1,2,3]+[4,5,6]

Mathematical Approach

Simple Linear Regression

Lets assume that the two variables are linearly related.
Find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).
x = [9, 10, 11, 12, 10, 9, 9, 10, 12, 11] y = [10, 11, 14, 13, 15, 11, 12, 11, 13, 15]
x as feature vector, i.e x = [x_1, x_2, …., x_n],
y as response vector, i.e y = [y_1, y_2, …., y_n]
for n observations (in above example, n=10).
Now, the task is to find a line which fits best in above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in dataset) This line is called regression line.

Here,

h(x_i) represents the predicted response value for ith observation. b(₀) and b(₁) are regression coefficients and represent y-intercept and slope of regression line respectively.

where (SS_xx) is the sum of cross-deviations of y and x:

In [3]:

import matplotlib.pyplot as plt
import pandas as pd 
import numpy as np 
%matplotlib inline 
x = [1,2,3,4,7,8,9,10, 11, 12, 10, 9, 9, 10, 12, 11,13,14,15,17,18,19,20,21,22,24,25,27] ##- Age
y = [4,6,12,16,18,19,20,21,21,18,20, 23, 24, 24, 24, 25,27,28,40,42,44,46,48,50,48,51,60,62] ##- weight

plt.scatter(x,y, edgecolors='r')
plt.xlabel('feature vector',color="y")
plt.ylabel('response vector',color="b")
plt.show()

Out[3]:

Converting X and Y into array using Numpy

In [4]:

import numpy as np 
x=np.array(x)
y=np.array(y)

Creating a function to determine regression coef

In [5]:

def estimate_coef(x, y): 
# number of observations/points 
 n = np.size(x) 

# mean of x and y vector 
 m_x, m_y = np.mean(x), np.mean(y) 

# calculating cross-deviation and deviation about x 
 SS_xy = np.sum(y*x) - n*m_y*m_x 
 SS_xx = np.sum(x*x) - n*m_x*m_x 
# calculating regression coefficients 
 b_1 = SS_xy / SS_xx 
 b_0 = m_y - b_1*m_x 
 return(b_0, b_1)

To Plot Regression line

In [6]:


def plot_regression_line(x, y, b): 
# plotting the actual points as scatter plot 
 plt.scatter(x, y, color = "b", marker = "*", s = 50) 

# predicted response vector 
 y_pred = b[0] + b[1]*x 

# plotting the regression line 
 plt.plot(x, y_pred, color = "coral") 

# putting labels 
 plt.xlabel('x') 
 plt.ylabel('y') 

# function to show plot 
 plt.show() 

def main(): 
# observations 
  x = [1,2,3,4,7,8,9,10, 11, 12, 10, 9, 9, 10, 12, 11,13,14,15,17,18,19,20,21,22,24,25,27] ##- Age
  y = [4,6,12,16,18,19,20,21,21,18,20, 23, 24, 24, 24, 25,27,28,40,42,44,46,48,50,48,51,60,62] ##- weight
  
  x=np.array(x)
  y=np.array(y)


# estimating coefficients 
  b = estimate_coef(x, y) 
  print("Estimated coefficients:\nb_0 = {} \n b_1 = {}".format(b[0], b[1])) 
 
# plotting regression line 
  plot_regression_line(x, y, b) 

if __name__ == "__main__": 
 main()

Out[6]:

Estimated coefficients:
b_0 = 1.3453817419580218 
 b_1 = 2.2130284055789957

Conclusion

Weight= 2.21(Age) + 1.34

In [8]:

##Let us do Prediction: 

Age=30

Weight = 2.21*Age+1.34

Weight

Out[8]:

67.64

In [ ]: