Path: blob/main/C2 - Advanced Learning Algorithms/week3/C2W3A1/C2_W3_Assignment.ipynb
3520 views
Practice Lab: Advice for Applying Machine Learning
In this lab, you will explore techniques to evaluate and improve your machine learning models.
Outline
1 - Packages
First, let's run the cell below to import all the packages that you will need during this assignment.
numpy is the fundamental package for scientific computing Python.
matplotlib is a popular library to plot graphs in Python.
scikitlearn is a basic library for data mining
tensorflow a popular platform for machine learning.
2 - Evaluating a Learning Algorithm (Polynomial Regression)
Let's say you have created a machine learning model and you find it fits your training data very well. You're done? Not quite. The goal of creating the model was to be able to predict values for new examples.
How can you test your model's performance on new data before deploying it? The answer has two parts:
Split your original data set into "Training" and "Test" sets.
Use the training data to fit the parameters of the model
Use the test data to evaluate the model on new data
Develop an error function to evaluate your model.
2.1 Splitting your data set
Lectures advised reserving 20-40% of your data set for testing. Let's use an sklearn
function train_test_split to perform the split. Double-check the shapes after running the following cell.
2.1.1 Plot Train, Test sets
You can see below the data points that will be part of training (in red) are intermixed with those that the model is not trained on (test). This particular data set is a quadratic function with noise added. The "ideal" curve is shown for reference.
All tests passed.
2.3 Compare performance on training and test data
Let's build a high degree polynomial model to minimize training error. This will use the linear_regression functions from sklearn
. The code is in the imported utility file if you would like to see the details. The steps below are:
create and fit the model. ('fit' is another name for training or running gradient descent).
compute the error on the training data.
compute the error on the test data.
The computed error on the training set is substantially less than that of the test set.
The following plot shows why this is. The model fits the training data very well. To do so, it has created a complex function. The test data was not part of the training and the model does a poor job of predicting on this data. This model would be described as 1) is overfitting, 2) has high variance 3) 'generalizes' poorly.
The test set error shows this model will not work well on new data. If you use the test error to guide improvements in the model, then the model will perform well on the test data... but the test data was meant to represent new data. You need yet another set of data to test new data performance.
The proposal made during lecture is to separate data into three groups. The distribution of training, cross-validation and test sets shown in the below table is a typical distribution, but can be varied depending on the amount of data available.
data | % of total | Description |
---|---|---|
training | 60 | Data used to tune model parameters and in training or fitting |
cross-validation | 20 | Data used to tune other model parameters like degree of polynomial, regularization or the architecture of a neural network. |
test | 20 | Data used to test the model after tuning to gauge performance on new data |
Let's generate three data sets below. We'll once again use train_test_split
from sklearn
but will call it twice to get three splits:
3 - Bias and Variance
Above, it was clear the degree of the polynomial model was too high. How can you choose a good value? It turns out, as shown in the diagram, the training and cross-validation performance can provide guidance. By trying a range of degree values, the training and cross-validation performance can be evaluated. As the degree becomes too large, the cross-validation performance will start to degrade relative to the training performance. Let's try this on our example.
3.2 Finding the optimal degree
In previous labs, you found that you could create a model capable of fitting complex curves by utilizing a polynomial (See Course1, Week2 Feature Engineering and Polynomial Regression Lab). Further, you demonstrated that by increasing the degree of the polynomial, you could create overfitting. (See Course 1, Week3, Over-Fitting Lab). Let's use that knowledge here to test our ability to tell the difference between over-fitting and under-fitting.
Let's train the model repeatedly, increasing the degree of the polynomial each iteration. Here, we're going to use the scikit-learn linear regression model for speed and simplicity.
Let's plot the result:
The plot above demonstrates that separating data into two groups, data the model is trained on and data the model has not been trained on, can be used to determine if the model is underfitting or overfitting. In our example, we created a variety of models varying from underfitting to overfitting by increasing the degree of the polynomial used.
On the left plot, the solid lines represent the predictions from these models. A polynomial model with degree 1 produces a straight line that intersects very few data points, while the maximum degree hews very closely to every data point.
on the right:
the error on the trained data (blue) decreases as the model complexity increases as expected
the error of the cross-validation data decreases initially as the model starts to conform to the data, but then increases as the model starts to over-fit on the training data (fails to generalize).
It's worth noting that the curves in these examples as not as smooth as one might draw for a lecture. It's clear the specific data points assigned to each group can change your results significantly. The general trend is what is important.
3.3 Tuning Regularization.
In previous labs, you have utilized regularization to reduce overfitting. Similar to degree, one can use the same methodology to tune the regularization parameter lambda ().
Let's demonstrate this by starting with a high degree polynomial and varying the regularization parameter.
Above, the plots show that as regularization increases, the model moves from a high variance (overfitting) model to a high bias (underfitting) model. The vertical line in the right plot shows the optimal value of lambda. In this example, the polynomial degree was set to 10.
The above plots show that when a model has high variance and is overfitting, adding more examples improves performance. Note the curves on the left plot. The final curve with the highest value of is a smooth curve that is in the center of the data. On the right, as the number of examples increases, the performance of the training set and cross-validation set converge to similar values. Note that the curves are not as smooth as one might see in a lecture. That is to be expected. The trend remains clear: more data improves generalization.
Note that adding more examples when the model has high bias (underfitting) does not improve performance.
Above, you can see the data on the left. There are six clusters identified by color. Both training points (dots) and cross-validataion points (triangles) are shown. The interesting points are those that fall in ambiguous locations where either cluster might consider them members. What would you expect a neural network model to do? What would be an example of overfitting? underfitting? On the right is an example of an 'ideal' model, or a model one might create knowing the source of the data. The lines represent 'equal distance' boundaries where the distance between center points is equal. It's worth noting that this model would "misclassify" roughly 8% of the total data set.
4.2 Evaluating categorical model by calculating classification error
The evaluation function for categorical models used here is simply the fraction of incorrect predictions: $$ J_{cv} =\frac{1}{m}\sum_{i=0}^{m-1} \begin{cases} 1, & \text{if $\hat{y}^{(i)} \neq y^{(i)}ParseError: KaTeX parse error: Expected 'EOF', got '}' at position 1: }̲\\ 0, & \te…$
Exercise 2
Below, complete the routine to calculate classification error. Note, in this lab, target values are the index of the category and are not one-hot encoded.
categorization error 0.333, expected:0.333
categorization error 0.250, expected:0.250
All tests passed.
All tests passed.
5 - Model Complexity
Below, you will build two models. A complex model and a simple model. You will evaluate the models to determine if they are likely to overfit or underfit.
5.1 Complex model
Exercise 3
Below, compose a three-layer model:
Dense layer with 120 units, relu activation
Dense layer with 40 units, relu activation
Dense layer with 6 units and a linear activation (not softmax) Compile using
loss with
SparseCategoricalCrossentropy
, remember to usefrom_logits=True
Adam optimizer with learning rate of 0.01.
Model: "Complex"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 120) 360
dense_1 (Dense) (None, 40) 4840
dense_2 (Dense) (None, 6) 246
=================================================================
Total params: 5,446
Trainable params: 5,446
Non-trainable params: 0
_________________________________________________________________
All tests passed!
This model has worked very hard to capture outliers of each category. As a result, it has miscategorized some of the cross-validation data. Let's calculate the classification error.
5.1 Simple model
Now, let's try a simple model
Exercise 4
Below, compose a two-layer model:
Dense layer with 6 units, relu activation
Dense layer with 6 units and a linear activation. Compile using
loss with
SparseCategoricalCrossentropy
, remember to usefrom_logits=True
Adam optimizer with learning rate of 0.01.
Model: "Simple"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_3 (Dense) (None, 6) 18
dense_4 (Dense) (None, 6) 42
=================================================================
Total params: 60
Trainable params: 60
Non-trainable params: 0
_________________________________________________________________
All tests passed!
This simple models does pretty well. Let's calculate the classification error.
Our simple model has a little higher classification error on training data but does better on cross-validation data than the more complex model.
6 - Regularization
As in the case of polynomial regression, one can apply regularization to moderate the impact of a more complex model. Let's try this below.
Exercise 5
Reconstruct your complex model, but this time include regularization. Below, compose a three-layer model:
Dense layer with 120 units, relu activation,
kernel_regularizer=tf.keras.regularizers.l2(0.1)
Dense layer with 40 units, relu activation,
kernel_regularizer=tf.keras.regularizers.l2(0.1)
Dense layer with 6 units and a linear activation. Compile using
loss with
SparseCategoricalCrossentropy
, remember to usefrom_logits=True
Adam optimizer with learning rate of 0.01.
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_5 (Dense) (None, 120) 360
dense_6 (Dense) (None, 40) 4840
dense_7 (Dense) (None, 6) 246
=================================================================
Total params: 5,446
Trainable params: 5,446
Non-trainable params: 0
_________________________________________________________________
ddd
All tests passed!
The results look very similar to the 'ideal' model. Let's check classification error.
The simple model is a bit better in the training set than the regularized model but it worse in the cross validation set.
As regularization is increased, the performance of the model on the training and cross-validation data sets converge. For this data set and model, lambda > 0.01 seems to be a reasonable choice.
Our test set is small and seems to have a number of outliers so classification error is high. However, the performance of our optimized models is comparable to ideal performance.
Congratulations!
You have become familiar with important tools to apply when evaluating your machine learning models. Namely:
splitting data into trained and untrained sets allows you to differentiate between underfitting and overfitting
creating three data sets, Training, Cross-Validation and Test allows you to
train your parameters with the training set
tune model parameters such as complexity, regularization and number of examples with the cross-validation set
evaluate your 'real world' performance using the test set.
comparing training vs cross-validation performance provides insight into a model's propensity towards overfitting (high variance) or underfitting (high bias)