GitHub Repository: suyashi29/python-su
Path: blob/master/ML Regression Analysis/2 Numpy for Linear Regression .ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

What is Linear regression?

Linear regression is an approach for modeling the relationship between two (simple linear regression) or more variables (multiple linear regression).
In simple linear regression, one variable is considered the predictor or independent variable, while the other variable is viewed as the outcome or dependent variable.

Component	Description
Formula	$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n + \epsilon$
Where:
$y$	Dependent variable (target)
$\beta_0$	Intercept (bias term)
$\beta_1, \dots, \beta_n$	Coefficients (slopes for each independent variable)
$x_1, \dots, x_n$	Independent variables (features)
$\epsilon$	Error term (residual)

Example Use Cases

Use Case	Description
House Price Prediction	Predict house price based on size, location, number of bedrooms, etc.
Sales Forecasting	Estimate future sales using past sales data, advertising spend, and seasonality
Student Performance Prediction	Predict exam scores based on hours studied, attendance, and prior grades
Health Risk Assessment	Estimate risk score based on age, BMI, smoking habits, and family history
Energy Consumption Estimation	Predict electricity usage from temperature, time of day, and appliance use

WHY Linear Regression?

To find the parameters so that the model best fits the data.
Forecasting an effect
Determing a Trend

Assumptions of Linear Regression

Linear relationship. One of the most important assumptions is that a linear relationship is said to exist between the dependent and the independent variables
No auto-correlation or independence. The residuals (error terms) are independent of each other. In other words, there is no correlation between the consecutive error terms of the time series data
No Multicollinearity. The independent variables shouldn’t be correlated. If multicollinearity exists between the independent variables, it is challenging to predict the outcome of the model
Homoscedasticity. Homoscedasticity means the residuals have constant variance at every level of x. The absence of this phenomenon is known as heteroscedasticity
Normal distribution of error terms. The last assumption that needs to be checked for linear regression is the error terms’ normal distribution

Weight(Y) = b1(Height(x))+b0

bo=y-b1(height) b1=y-bo/x b1= b0

In [ ]:

a=[3,8,7]
a+a
#a*a
import numpy as np
a1=np.array(a)
a1+a1
a1/a1

Mathematical Approach

Simple Linear Regression

Lets assume that the two variables are linearly related.
Find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).
x = [9, 10, 11, 12, 10, 9, 9, 10, 12, 11] y = [10, 11, 14, 13, 15, 11, 12, 11, 13, 15]
x as feature vector, i.e x = [x_1, x_2, …., x_n],
y as response vector, i.e y = [y_1, y_2, …., y_n]
for n observations (in above example, n=10).
Now, the task is to find a line which fits best in above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in dataset) This line is called regression line.

Here,

h(x_i) represents the predicted response value for ith observation. b(₀) and b(₁) are regression coefficients and represent y-intercept and slope of regression line respectively.

where (SS_xx) is the sum of cross-deviations of y and x:

Predict Weight for a given Age

In [ ]:

Weight = a*(Age) + c

In [1]:

import matplotlib.pyplot as plt
import pandas as pd 
import numpy as np 
%matplotlib inline 
x = [1,2,3,4,7,8,9,10, 11, 12, 10, 9, 9, 10, 12, 11,13,14,15,17,18,19,20,21,22,24,25,27] ##- Age
y = [4,6,12,16,18,19,20,21,21,18,20, 23, 24, 24, 24, 25,27,28,40,42,44,46,48,50,48,51,60,62] ##- weight

plt.scatter(x,y, edgecolors='b')
plt.xlabel('feature vector',color="y")
plt.ylabel('response vector',color="b")
plt.show()

Out[1]:

Converting X and Y into array using Numpy

In [2]:

import numpy as np 
x=np.array(x)
y=np.array(y)

Creating a function to determine regression coef

In [3]:

def estimate_coef(x, y): 
# number of observations/points 
 n = np.size(x) 

# mean of x and y vector 
 m_x, m_y = np.mean(x), np.mean(y) 

# calculating cross-deviation and deviation about x 
 SS_xy = np.sum(y*x) - n*m_y*m_x 
 SS_xx = np.sum(x*x) - n*m_x*m_x 
# calculating regression coefficients 
 b_1 = SS_xy / SS_xx 
 b_0 = m_y - b_1*m_x 
 return(b_0, b_1)

To Plot Regression line

In [4]:


def plot_regression_line(x, y, b): 
# plotting the actual points as scatter plot 
 plt.scatter(x, y, color = "y", marker = "*", s = 50) 

# predicted response vector 
 y_pred = b[0] + b[1]*x 

# plotting the regression line 
 plt.plot(x, y_pred, color = "coral") 

# putting labels 
 plt.xlabel('x') 
 plt.ylabel('y') 

# function to show plot 
 plt.show() 

def main(): 
# observations 
  x = [1,2,3,4,7,8,9,10, 11, 12, 10, 9, 9, 10, 12, 11,13,14,15,17,18,19,20,21,22,24,25,27] ##- Age
  y = [4,6,12,16,18,19,20,21,21,18,20, 23, 24, 24, 24, 25,27,28,40,42,44,46,48,50,48,51,60,62] ##- weight
  
  x=np.array(x)
  y=np.array(y)


# estimating coefficients 
  b = estimate_coef(x, y) 
  print("Estimated coefficients:\nb_0 = {} \n b_1 = {}".format(b[0], b[1])) 
 
# plotting regression line 
  plot_regression_line(x, y, b) 

if __name__ == "__main__": 
 main()

Out[4]:

Estimated coefficients:
b_0 = 1.3453817419580218 
 b_1 = 2.2130284055789957

Conclusion

Weight= 2.21(Age) + 1.34

In [5]:

##Let us do Prediction: 

Age=30

Weight = 2.21*Age+1.34

Weight

Out[5]:

67.64

Define the linear equation:

y=5x+4

Add noise:

y=5x+4+ε, where 𝜀 ∼ 𝑁 ( 0 , 𝜎 2 ) ε∼N(0,σ 2 )

Generate data points: Create a set of x values and compute corresponding noisy y values.

a=np.arange(0,100,2)
a
x = np.linspace(1,5 , 3)
x

In [14]:

import numpy as np
import matplotlib.pyplot as plt

# Step 1: Generate x values
x = np.linspace(-10, 10, 100)  # 100 points between -10 and 10

# Step 2: Define parameters for the line
slope = 5
intercept = 4

# Step 3: Generate Gaussian noise
noise = np.random.normal(loc=0, scale=5, size=x.shape)  # Mean 0, std dev 5

# Step 4: Compute y = 5x + 4 + noise
y = slope * x + intercept + noise

# Optional: Plot the data
plt.scatter(x, y, label='Noisy Data', alpha=0.7)
plt.plot(x, slope * x + intercept, color='red', label='True Line')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('Linear Equation with Noise: y = 5x + 4 + noise')
plt.grid(True)
plt.show()

Out[14]: