Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/ML Regression Analysis/2 Numpy for Linear Regression .ipynb
3074 views
Kernel: Python 3 (ipykernel)

What is Linear regression?

  • Linear regression is an approach for modeling the relationship between two (simple linear regression) or more variables (multiple linear regression).

  • In simple linear regression, one variable is considered the predictor or independent variable, while the other variable is viewed as the outcome or dependent variable. image.png

ComponentDescription
Formulay=β0+β1x1+β2x2++βnxn+ϵy = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n + \epsilon
Where:
yyDependent variable (target)
β0\beta_0Intercept (bias term)
β1,,βn\beta_1, \dots, \beta_nCoefficients (slopes for each independent variable)
x1,,xnx_1, \dots, x_nIndependent variables (features)
ϵ\epsilonError term (residual)

Example Use Cases

Use CaseDescription
House Price PredictionPredict house price based on size, location, number of bedrooms, etc.
Sales ForecastingEstimate future sales using past sales data, advertising spend, and seasonality
Student Performance PredictionPredict exam scores based on hours studied, attendance, and prior grades
Health Risk AssessmentEstimate risk score based on age, BMI, smoking habits, and family history
Energy Consumption EstimationPredict electricity usage from temperature, time of day, and appliance use

WHY Linear Regression?

  • To find the parameters so that the model best fits the data.

  • Forecasting an effect

  • Determing a Trend

Assumptions of Linear Regression

  • Linear relationship. One of the most important assumptions is that a linear relationship is said to exist between the dependent and the independent variables

  • No auto-correlation or independence. The residuals (error terms) are independent of each other. In other words, there is no correlation between the consecutive error terms of the time series data

  • No Multicollinearity. The independent variables shouldn’t be correlated. If multicollinearity exists between the independent variables, it is challenging to predict the outcome of the model

  • Homoscedasticity. Homoscedasticity means the residuals have constant variance at every level of x. The absence of this phenomenon is known as heteroscedasticity

  • Normal distribution of error terms. The last assumption that needs to be checked for linear regression is the error terms’ normal distribution

Weight(Y) = b1(Height(x))+b0

bo=y-b1(height) b1=y-bo/x b1= b0

a=[3,8,7] a+a #a*a import numpy as np a1=np.array(a) a1+a1 a1/a1

Mathematical Approach

Simple Linear Regression

  • Lets assume that the two variables are linearly related.

  • Find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).

    x = [9, 10, 11, 12, 10, 9, 9, 10, 12, 11] y = [10, 11, 14, 13, 15, 11, 12, 11, 13, 15]

  • x as feature vector, i.e x = [x_1, x_2, …., x_n],

  • y as response vector, i.e y = [y_1, y_2, …., y_n]

  • for n observations (in above example, n=10).

  • Now, the task is to find a line which fits best in above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in dataset) This line is called regression line.

image.png

Here,

h(xi) represents the predicted response value for ith observation. b(0) and b(1) are regression coefficients and represent y-intercept and slope of regression line respectively.

  • where (SSxx) is the sum of cross-deviations of y and x: image.png

image.png

Predict Weight for a given Age

Weight = a*(Age) + c
import matplotlib.pyplot as plt import pandas as pd import numpy as np %matplotlib inline x = [1,2,3,4,7,8,9,10, 11, 12, 10, 9, 9, 10, 12, 11,13,14,15,17,18,19,20,21,22,24,25,27] ##- Age y = [4,6,12,16,18,19,20,21,21,18,20, 23, 24, 24, 24, 25,27,28,40,42,44,46,48,50,48,51,60,62] ##- weight plt.scatter(x,y, edgecolors='b') plt.xlabel('feature vector',color="y") plt.ylabel('response vector',color="b") plt.show()
Image in a Jupyter notebook

Converting X and Y into array using Numpy

import numpy as np x=np.array(x) y=np.array(y)

Creating a function to determine regression coef

def estimate_coef(x, y): # number of observations/points n = np.size(x) # mean of x and y vector m_x, m_y = np.mean(x), np.mean(y) # calculating cross-deviation and deviation about x SS_xy = np.sum(y*x) - n*m_y*m_x SS_xx = np.sum(x*x) - n*m_x*m_x # calculating regression coefficients b_1 = SS_xy / SS_xx b_0 = m_y - b_1*m_x return(b_0, b_1)

To Plot Regression line

def plot_regression_line(x, y, b): # plotting the actual points as scatter plot plt.scatter(x, y, color = "y", marker = "*", s = 50) # predicted response vector y_pred = b[0] + b[1]*x # plotting the regression line plt.plot(x, y_pred, color = "coral") # putting labels plt.xlabel('x') plt.ylabel('y') # function to show plot plt.show() def main(): # observations x = [1,2,3,4,7,8,9,10, 11, 12, 10, 9, 9, 10, 12, 11,13,14,15,17,18,19,20,21,22,24,25,27] ##- Age y = [4,6,12,16,18,19,20,21,21,18,20, 23, 24, 24, 24, 25,27,28,40,42,44,46,48,50,48,51,60,62] ##- weight x=np.array(x) y=np.array(y) # estimating coefficients b = estimate_coef(x, y) print("Estimated coefficients:\nb_0 = {} \n b_1 = {}".format(b[0], b[1])) # plotting regression line plot_regression_line(x, y, b) if __name__ == "__main__": main()
Estimated coefficients: b_0 = 1.3453817419580218 b_1 = 2.2130284055789957
Image in a Jupyter notebook

Conclusion

Weight= 2.21(Age) + 1.34

##Let us do Prediction: Age=30 Weight = 2.21*Age+1.34 Weight
67.64

Define the linear equation:

  • y=5x+4

Add noise:

  • y=5x+4+ε, where 𝜀 𝑁 ( 0 , 𝜎 2 ) ε∼N(0,σ 2 )

Generate data points: Create a set of x values and compute corresponding noisy y values.

a=np.arange(0,100,2) a x = np.linspace(1,5 , 3) x
import numpy as np import matplotlib.pyplot as plt # Step 1: Generate x values x = np.linspace(-10, 10, 100) # 100 points between -10 and 10 # Step 2: Define parameters for the line slope = 5 intercept = 4 # Step 3: Generate Gaussian noise noise = np.random.normal(loc=0, scale=5, size=x.shape) # Mean 0, std dev 5 # Step 4: Compute y = 5x + 4 + noise y = slope * x + intercept + noise # Optional: Plot the data plt.scatter(x, y, label='Noisy Data', alpha=0.7) plt.plot(x, slope * x + intercept, color='red', label='True Line') plt.xlabel('x') plt.ylabel('y') plt.legend() plt.title('Linear Equation with Noise: y = 5x + 4 + noise') plt.grid(True) plt.show()
Image in a Jupyter notebook