Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/07. Data Analysis with Python/05. Model Evaluation and Refinement/05. Model Evaluation and Refinement.ipynb
Views: 4598
Model Evaluation and Refinement
Objectives
After completing this lab you will be able to:
Evaluate and refine prediction models
First, let's only use numeric data:
Libraries for plotting:
Functions for Plotting
Part 1: Training and Testing
An important step in testing your model is to split your data into training and testing data. We will place the target data price in a separate dataframe y_data:
Drop price data in dataframe x_data:
Now, we randomly split our data into training and testing data using the function train_test_split.
x_data: features or independent variables
y_data: dataset target
x_train, y_train: parts of available data as training set
x_test, y_test: parts of available data as testing set
test_size: percentage of the data for testing (here 10%)
Question #1):
Use the function "train_test_split" to split up the dataset such that 40% of the data samples will be utilized for testing. Set the parameter "random_state" equal to zero. The output of the function should be the following: "x_train1" , "x_test1", "y_train1" and "y_test1".
Let's import LinearRegression from the module linear_model.
We create a Linear Regression object:
We fit the model using the feature "horsepower":
Let's calculate the R^2 on the test data:
We can see the R^2 is much smaller using the test data compared to the training data.
Question #2):
Find the R^2 on the test data using 40% of the dataset for testing.Sometimes you do not have sufficient testing data; as a result, you may want to perform cross-validation. Let's go over several methods that you can use for cross-validation.
Cross-Validation Score
To overcome the problem of low precision with different combinations of samples, we use cross-validation. In this method, the dataset is split into K equal groups. Each group is referred to as a fold. For example, four folds. Some of the folds can be used as a training set which we use to train the model and the remaining parts are used as a test set, which we use to test the model.
For example, we can use three folds for training, then use one fold for testing. This is repeated until each partition is used for both training and testing. At the end, we use the average results as the estimate of out-of-sample error. The evaluation metric depends on the model, for example, the r squared. The simplest way to apply cross-validation is to call the cross_val_score
function, which performs multiple out-of-sample evaluations.
This method is imported from sklearn's model selection package. We then use the function cross_val_score
.
The first input parameter is the type of model we are using to do the cross-validation. In this example, we initialize a linear regression model or object
lre
which we passed thecross_val_score
function.The other parameters are x_data, the predictive variable data, and y_data, the target variable data.
We can manage the number of partitions with the cv parameter. Here, , which means the data set is split into four equal partitions. The function returns an array of scores, one for each partition that was chosen as the testing set. We can average the result together to estimate out of sample .
Let's import model_selection from the module cross_val_score.
We input the object, the feature ("horsepower"), and the target data (y_data). The parameter 'cv' determines the number of folds. In this case, it is 4.
The default scoring is R^2. Each element in the array has the average R^2 value for the fold:
We can calculate the average and standard deviation of our estimate:
We can use negative squared error as a score by setting the parameter 'scoring' metric to 'neg_mean_squared_error'.
Question #3):
Calculate the average R^2 using two folds, then find the average R^2 for the second fold utilizing the "horsepower" feature:We input the object, the feature "horsepower", and the target data y_data. The parameter 'cv' determines the number of folds. In this case, it is 4. We can produce an output:
Part 2: Overfitting, Underfitting and Model Selection
It turns out that the test data, sometimes referred to as the "out of sample data", is a much better measure of how well your model performs in the real world. One reason for this is overfitting.
Let's go over some examples. It turns out these differences are more apparent in Multiple Linear Regression and Polynomial Regression so we will explore overfitting in that context.
Let's create Multiple Linear Regression objects and train the model using 'horsepower', 'curb-weight', 'engine-size' and 'highway-mpg' as features.
Prediction using training data:
Prediction using test data:
Let's perform some model evaluation using our training and testing data separately. First, we import the seaborn and matplotlib library for plotting.
Let's examine the distribution of the predicted values of the training data.
Figure 1: Plot of predicted values using the training data compared to the actual values of the training data.
So far, the model seems to be doing well in learning from the training dataset. But what happens when the model encounters new data from the testing dataset? When the model generates new values from the test data, we see the distribution of the predicted values is much different from the actual target values.
Figure 2: Plot of predicted value using the test data compared to the actual values of the test data.
Comparing Figure 1 and Figure 2, it is evident that the distribution of the test data in Figure 1 is much better at fitting the data. This difference in Figure 2 is apparent in the range of 5000 to 15,000. This is where the shape of the distribution is extremely different. Let's see if polynomial regression also exhibits a drop in the prediction accuracy when analysing the test dataset.
Overfitting
Overfitting occurs when the model fits the noise, but not the underlying process. Therefore, when testing your model using the test set, your model does not perform as well since it is modelling noise, not the underlying process that generated the relationship. Let's create a degree 5 polynomial model.
Let's use 55 percent of the data for training and the rest for testing:
We will perform a degree 5 polynomial transformation on the feature 'horsepower'.
Now, let's create a Linear Regression model "poly" and train it.
We can see the output of our model using the method "predict." We assign the values to "yhat".
Let's take the first five predicted values and compare it to the actual targets.
We will use the function "PollyPlot" that we defined at the beginning of the lab to display the training data, testing data, and the predicted function.
Figure 3: A polynomial regression model where red dots represent training data, green dots represent test data, and the blue line represents the model prediction.
We see that the estimated function appears to track the data but around 200 horsepower, the function begins to diverge from the data points.
R^2 of the training data:
R^2 of the test data:
We see the R^2 for the training data is 0.5567 while the R^2 on the test data was -29.87. The lower the R^2, the worse the model. A negative R^2 is a sign of overfitting.
Let's see how the R^2 changes on the test data for different order polynomials and then plot the results:
We see the R^2 gradually increases until an order three polynomial is used. Then, the R^2 dramatically decreases at an order four polynomial.
The following function will be used in the next section. Please run the cell below.
The following interface allows you to experiment with different polynomial orders and different amounts of data.
Question #4a):
We can perform polynomial transformations with more than one feature. Create a "PolynomialFeatures" object "pr1" of degree two.
Question #4b):
Transform the training and testing samples for the features 'horsepower', 'curb-weight', 'engine-size' and 'highway-mpg'. Hint: use the method "fit_transform".
Question #4c):
How many dimensions does the new feature have? Hint: use the attribute "shape".Question #4d):
Create a linear regression model "poly1". Train the object using the method "fit" using the polynomial features.
Question #4e):
Use the method "predict" to predict an output on the polynomial features, then use the function "DistributionPlot" to display the distribution of the predicted test output vs. the actual test data.Question #4f):
Using the distribution plot above, describe (in words) the two regions where the predicted prices are less accurate than the actual prices.
Part 3: Ridge Regression
In this section, we will review Ridge Regression and see how the parameter alpha changes the model. Just a note, here our test data will be used as validation data.
Ridge regression is a regression that is employed in a Multiple regression model when Multicollinearity occurs. Multicollinearity is when there is a strong relationship among the independent variables. Ridge regression is very common with polynomial regression. Ridge regression can be used to regularize and reduce the standard errors to avoid over-fitting a regression model.
We start with an alpha value, use it as an argument in the constructor, we train the model, make a prediction using the validation data, then calculate the R-squared and store the values.
Repeat the value for a larger value of alpha.
We train the model again, make a prediction using the validation data, then calculate the R-squared and store the values of R-squared.
We repeat the process for a different alpha value, training the model, and making a prediction.
We select the value of alpha that maximizes the R-squared.
The general syntax is
from sklearn.linear_model import Ridge
RidgeModel = Ridge(alpha = 0.1)
RidgeModel.fit(X,y)
Yhat = RidgeModel.predict(X)
Let's perform a degree two polynomial transformation on our data.
Let's import Ridge from the module linear models.
Let's create a Ridge regression object, setting the regularization parameter (alpha) to 0.1
Like regular regression, you can fit the model using the method fit.
Similarly, you can obtain a prediction:
Let's compare the first five predicted samples to our test set:
We select the value of alpha that minimizes the test error. To do so, we can use a for loop. We have also created a progress bar to see how many iterations we have completed so far.
We can plot out the value of R^2 for different alphas:
Figure 4: The blue line represents the R^2 of the validation data, and the red line represents the R^2 of the training data. The x-axis represents the different values of Alpha.
Here the model is built and tested on the same data, so the training and test data are the same.
The red line in Figure 4 represents the R^2 of the training data. As alpha increases the R^2 decreases. Therefore, as alpha increases, the model performs worse on the training data
The blue line represents the R^2 on the validation data. As the value for alpha increases, the R^2 increases and converges at a point.
Question #5):
Perform Ridge regression. Calculate the R^2 using the polynomial features, use the training data to train the model and use the test data to test the model. The parameter alpha should be set to 10.
Part 4: Grid Search
The term alpha is a hyperparameter.
Grid Search allows us to scan through multiple free parameters with few lines of code. Grid Search takes the model or objects you would like to train and different values of the hyperparameters. It then calculates the mean square error or R-squared for various hyperparameter values, allowing you to choose the best values.
Let's import GridSearchCV from the module model_selection.
We create a dictionary of parameter values:
Create a Ridge regression object:
Create a ridge grid search object:
Fit the model:
The object finds the best parameter values on the validation data. We can obtain the estimator with the best parameters and assign it to the variable BestRR as follows:
We now test our model on the test data: