Path: blob/main/C3/W2/assignment/C3W2_Assignment.ipynb
3492 views
Week 2: Diving deeper into the BBC News archive
Welcome! In this assignment you will be revisiting the BBC News Classification Dataset, which contains 2225 examples of news articles with their respective labels.
This time you will not only work with the tokenization process, but you will also create a classifier using specialized layers for text data such as Embedding and GlobalAveragePooling1D.
TIPS FOR SUCCESSFUL GRADING OF YOUR ASSIGNMENT:
All cells are frozen except for the ones where you need to submit your solutions or when explicitly mentioned you can interact with it.
You can add new cells to experiment but these will be omitted by the grader, so don't rely on newly created cells to host your solution code, use the provided places for this.
You can add the comment # grade-up-to-here in any graded cell to signal the grader that it must only evaluate up to that point. This is helpful if you want to check if you are on the right track even if you are not done with the whole assignment. Be sure to remember to delete the comment afterwards!
Avoid using global variables unless you absolutely have to. The grader tests your code in an isolated environment without running all cells from the top. As a result, global variables may be unavailable when scoring your submission. Global variables that are meant to be used will be defined in UPPERCASE.
To submit your notebook, save it and then click on the blue submit button at the beginning of the page.
Let's get started!
For this assignment the data comes from a csv. You can find the file bbc-text.csv under the ./data folder. Run the next cell to take a peek into the structure of the data.
As you can see, each data point is composed of the category of the news article followed by a comma and then the actual text of the article. The comma here is used to delimit columns.
Defining useful global variables
Next you will define some global variables that will be used throughout the assignment. Feel free to reference them in the upcoming exercises:
VOCAB_SIZE: The maximum number of words to keep, based on word frequency. Defaults to 1000.EMBEDDING_DIM: Dimension of the dense embedding, will be used in the embedding layer of the model. Defaults to 16.MAX_LENGTH: Maximum length of all sequences. Defaults to 120.TRAINING_SPLIT: Proportion of data used for training. Defaults to 0.8
A note about grading:
When you submit this assignment for grading these same values for these globals will be used so make sure that all your code works well with these values. After submitting and passing this assignment, you are encouraged to come back here and play with these parameters to see the impact they have in the classification process. Since this next cell is frozen, you will need to copy the contents into a new cell and run it to overwrite the values for these globals.
Loading and pre-processing the data
Go ahead and open the data by running the cell below. While there are many ways in which you can do this, this implementation takes advantage of the Numpy function loadtxt to load the data. Since the file is saved in a csv format, you need to set the parameter delimiter=',', otherwise the function splits at whitespaces by default. Also, you need to set dtype='str' to indicate that the expected content type is a string.
As expected, you get a Numpy array with shape (2225, 2). This means that you have 2225 rows, and 2 columns. As seen in the output of the previous cell, the first column corresponds to labels, and the second one corresponds to texts.
Expected Output:
Training - Validation Datasets
Exercise 1: train_val_datasets
Now you will code the train_val_datasets function, which, given the data DataFrame, should return the training and validation datasets, consisting of (text, label) pairs. For this last part, you will be using the tf.data.Dataset.from_tensor_slices method.
Expected Output:
Vectorization - Sequences and padding
With your training and validation data it is now time to perform the vectorization. However, first you need an important intermediate step which is to define a standardize function, which will be used to apply a transformation to every entry in your dataset in an attempt to standardize it. In this case you will use a function that removes stopwords from the texts in the dataset. This should improve the performance of your classifier by removing frequently used words that don't add information to determine the topic of the news. The function also removes any punctuation and makes all words lowercase. This function is already provided for you and can be found in the cell below:
Run the cell below to see this standardizing function in action. You can also try with your own sentences:
Exercise 2: fit_vectorizer
Next complete the fit_vectorizer function below. This function should return a TextVectorization layer that has already been fitted on the training sentences. The vocabulary learned by the vectorizer should have VOCAB_SIZE size, and truncate the output sequences to have MAX_LENGTH length.
Remember to use the custom function standardize_func to standardize each sentence in the vectorizer. You can do this by passing the function to the standardize parameter of TextVectorization. You are encouraged to take a look into the documentation to get a better understanding of how this works.
Expected Output:
Exercise 3: fit_label_encoder
Remember your categories are also text labels, so you need to encode the labels as well. For this complete the tokenize_labels function below.
A couple of things to note:
Use the function
tf.keras.layers.StringLookupto encode the labels. Use the correct parameters so that you don't include any OOV tokens.You should fit the tokenizer to all the labels to avoid the case of a particular label not being present in the validation set. Since you are dealing with labels there should never be an OOV label. For this, you can concatenate the two datasets using the
concatenatemethod fromtf.data.Datasetobjects.
Use your function to create a trained instance of the encoder, and print the obtained vocabulary to check that there are no OOV tokens.
Expected Output:
Exercise 4: preprocess_dataset
Now that you have trained the vectorizer for the texts and the encoder for the labels, it's time for you to actually transform the dataset. For this complete the preprocess_dataset function below. Use this function to set the dataset batch size to 32
Hint:
Expected Output:
Expected output:
Selecting the model for text classification
Exercise 5: create_model
Now that the data is ready to be fed into a Neural Network it is time for you to define the model that will classify each text as being part of a certain category.
For this complete the create_model below.
A couple of things to keep in mind:
The last layer should be a Dense layer with 5 units (since there are 5 categories) with a softmax activation.
You should also compile your model using an appropriate loss function and optimizer.
You can use any architecture you want but keep in mind that this problem doesn't need many layers to be solved successfully. You don't need any layers beside Embedding, GlobalAveragePooling1D and Dense layers but feel free to try out different architectures.
To pass this graded function your model should reach at least a 95% training accuracy and a 90% validation accuracy in under 30 epochs.
The next cell allows you to check the number of total and trainable parameters of your model and prompts a warning in case these exceeds those of a reference solution, this serves the following 3 purposes listed in order of priority:
Helps you prevent crashing the kernel during training.
Helps you avoid longer-than-necessary training times.
Provides a reasonable estimate of the size of your model. In general you will usually prefer smaller models given that they accomplish their goal successfully.
Notice that this is just informative and may be very well below the actual limit for size of the model necessary to crash the kernel. So even if you exceed this reference you are probably fine. However, if the kernel crashes during training or it is taking a very long time and your model is larger than the reference, come back here and try to get the number of parameters closer to the reference.
Expected output:
Once training has finished you can run the following cell to check the training and validation accuracy achieved at the end of each epoch.
Remember that to pass this assignment your model should achieve a training accuracy of at least 95% and a validation accuracy of at least 90%. If your model didn't achieve these thresholds, try training again with a different model architecture.
If your model passes the previously mentioned thresholds, and you are happy with the results, be sure to save your notebook and submit it for grading. Also run the cell below to save the history of the model. This is needed for grading purposes
Optional Exercise - Visualizing 3D Vectors
As you saw on the lecture you can visualize the vectors associated with each word in the training set in a 3D space.
For this run the following cell, which will create the metadata.tsv and weights.tsv files. These are the ones you are going to upload toTensorflow's Embedding Projector.
By running the previous cell, these files are placed within your filesystem. To download them, right click on the file, which you will see on the left sidebar, and select the Download option.
Congratulations on finishing this week's assignment!
You have successfully implemented a neural network capable of classifying text and also learned about embeddings and tokenization along the way!
Keep it up!