Path: blob/master/Advanced Data Analysis using Python/2 Numpy for Linear Regression .ipynb
3074 views
What is Linear regression?
Linear regression is an approach for modeling the relationship between two (simple linear regression) or more variables (multiple linear regression).
In simple linear regression, one variable is considered the predictor or independent variable, while the other variable is viewed as the outcome or dependent variable.
Examples
Predicting Exam scores based on study hours
Predicting Profit for Quantity Sold
Predicting Employe retention with years
WHY Linear Regression?
To find the parameters so that the model best fits the data.
Forecasting an effect
Determing a Trend
Assumptions of Linear Regression
Linear relationship. One of the most important assumptions is that a linear relationship is said to exist between the dependent and the independent variables
No auto-correlation or independence. The residuals (error terms) are independent of each other. In other words, there is no correlation between the consecutive error terms of the time series data
No Multicollinearity. The independent variables shouldn’t be correlated. If multicollinearity exists between the independent variables, it is challenging to predict the outcome of the model
Homoscedasticity. Homoscedasticity means the residuals have constant variance at every level of x. The absence of this phenomenon is known as heteroscedasticity
Normal distribution of error terms. The last assumption that needs to be checked for linear regression is the error terms’ normal distribution
Weight(Y) = b1(Height(x))+b0
bo=y-b1(height) b1=y-bo/x b1= b0
Mathematical Approach
Simple Linear Regression
Lets assume that the two variables are linearly related.
Find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).
x = [9, 10, 11, 12, 10, 9, 9, 10, 12, 11] y = [10, 11, 14, 13, 15, 11, 12, 11, 13, 15]
x as feature vector, i.e x = [x_1, x_2, …., x_n],
y as response vector, i.e y = [y_1, y_2, …., y_n]
for n observations (in above example, n=10).
Now, the task is to find a line which fits best in above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in dataset) This line is called regression line.
Here,
h(xi) represents the predicted response value for ith observation. b(0) and b(1) are regression coefficients and represent y-intercept and slope of regression line respectively.
where (SSxx) is the sum of cross-deviations of y and x:
Converting X and Y into array using Numpy
Creating a function to determine regression coef
To Plot Regression line
Conclusion
Weight= 2.21(Age) + 1.34