Path: blob/master/Machine Learning Supervised Methods/Day 2 Linear Regression using Python.ipynb
3074 views
Linear regression is a statistical approach for modelling relationship between a dependent variable with a given set of independent variables.
Simple Linear Regression Simple linear regression is an approach for predicting a response using a single feature.
WHY Linear Regression?
To find the parameters so that the model best fits the data.
Forecasting an effect
Determing a Trend
How do we determine the best fit line?
The line for which the the error between the predicted values and the observed values is minimum is called the best fit line or the regression line. These errors are also called as residuals.
The residuals can be visualized by the vertical lines from the observed data value to the regression line.
Simple Linear Regression
In this regression task we will predict the percentage of marks that a student is expected to score based upon the number of hours they studied. This is a simple linear regression task as it involves just two variables.
Linear regression calculates an equation that minimizes the distance between the fitted line and all of the data points
Describe model(pandas and data frame conversion)
Numpy mathemathical (ararys)
Plots
Skilearn module to import regression alogirthm
Using Pandas and scikit-learn lib Predict scores of student on basis of study hours
% Matplotlib : With this backend, the output of plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that produced it
Visualizing fields
Test and Train (Checking for validation: every property of all data,)
Train 80%,75%,70% , Test= 20%, 25%,30%
Preparing the Data
Now we have an idea about statistical details of our data. The next step is to divide the data into "target" and "feature". Features are the independent variables while target is the dependent variable whose values are to be predicted.
training set—a subset to train a model.
test set—a subset to test the trained model.
How to split:
Make sure that your test set meets the following two conditions:
Is large enough to yield statistically meaningful results.
Is representative of the data set as a whole.
Overfitting & Underfitting
Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data. Intuitively, overfitting occurs when the model or the algorithm fits the data too well. Specifically, overfitting occurs if the model or algorithm shows low bias but high variance.
Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. Specifically, underfitting occurs if the model or algorithm shows low variance but high bias. Underfitting is often a result of an excessively simple model.
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.
Correlation coefficient
Is a measure of the association between two variables. It is used to find the relationship is between data and a measure to check how strong it is. The formulas return a value between -1 and 1 wherein one shows -1 shows negative correlation and +1 shows a positive correlation
Positive value 1: -1 ,0
Scikit-Learn's built-in train_test_split() method:
How to split:
Make sure that your test set meets the following two conditions:
Is large enough to yield statistically meaningful results.
Is representative of the data set as a whole.
The above script splits 80% of the data to training set while 20% of the data to test set. The test_size variable is where we actually specify the proportion of test set.
random_state : is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case.If random_state is None or np.random, then a randomly-initialized RandomState object is returned.If random_state is an integer, then it is used to seed a new RandomState object.
Training the Algorithm
We have split our data into training and testing sets, and now is finally the time to train our algorithm.
fit_intercept : boolean, optional, default True:If set to False, no intercept will be used in calculations.
normalize : boolean, optional, default False:This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.
copy_X : boolean, optional, default True :If True, X will be copied; else, it may be overwritten.
n_jobs : int or None, optional (default=None):The number of jobs to use for the computation. This will only provide speedup for n_targets > 1 and sufficient large problems
Lets see how linear regression model basically finds the best value for the intercept and slope, which results in a line that best fits the data.
Score= 12 + 9(S.H)
Insight
This means that for every one unit of change in hours studied, the change in the score is about 10%
Making Predictions
Now that we have trained our algorithm, it's time to make some predictions. To do so, we will use our test data and see how accurately our algorithm predicts the percentage score. To make pre-dictions on the test data, execute the following script:
y= 1,2,3,4
To compare the actual output values for X_test with the predicted values
Evaluating the Algorithm
The final step is to evaluate the performance of algorithm. This step is particularly important to compare how well different algorithms perform on a particular dataset. For regression algorithms, three evaluation metrics are commonly used:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors
Mean Squared Error (MSE) is the mean of the squared errors
*Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors
R² score or the coefficient of determination explains how much the total variance of the dependent variable can be reduced by using the least square regression. (0-1)
Insight:
We reduced the prediction error by ~ 37% by using regression
Performance Improvement by Cross validation
In this approach, we reserve 50% of the dataset for validation and the remaining 50% for model training. However, a major disadvantage of this approach is that since we are training a model on only 50% of the dataset, there is a huge possibility that we might miss out on some interesting information about the data which will lead to a higher bias
Model Correction
One commonly used method for doing this is known as k-fold cross-validation , which uses the following approach: 1. Randomly divide a dataset into k groups, or “folds”, of roughly equal size. 2. Choose one of the folds to be the holdout set. Fit the model on the remaining k-1 folds
Insights
There are certain sections in data for which prediction per is greater than 90%