Path: blob/master/first_edition/6.4-sequence-processing-with-convnets.ipynb
709 views
Sequence processing with convnets
This notebook contains the code samples found in Chapter 6, Section 4 of Deep Learning with Python. Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.
Implementing a 1D convnet
In Keras, you would use a 1D convnet via the Conv1D
layer, which has a very similar interface to Conv2D
. It takes as input 3D tensors with shape (samples, time, features)
and also returns similarly-shaped 3D tensors. The convolution window is a 1D window on the temporal axis, axis 1 in the input tensor.
Let's build a simple 2-layer 1D convnet and apply it to the IMDB sentiment classification task that you are already familiar with.
As a reminder, this is the code for obtaining and preprocessing the data:
1D convnets are structured in the same way as their 2D counter-parts that you have used in Chapter 5: they consist of a stack of Conv1D
and MaxPooling1D
layers, eventually ending in either a global pooling layer or a Flatten
layer, turning the 3D outputs into 2D outputs, allowing to add one or more Dense
layers to the model, for classification or regression.
One difference, though, is the fact that we can afford to use larger convolution windows with 1D convnets. Indeed, with a 2D convolution layer, a 3x3 convolution window contains 3*3 = 9 feature vectors, but with a 1D convolution layer, a convolution window of size 3 would only contain 3 feature vectors. We can thus easily afford 1D convolution windows of size 7 or 9.
This is our example 1D convnet for the IMDB dataset:
Here are our training and validation results: validation accuracy is somewhat lower than that of the LSTM we used two sections ago, but runtime is faster, both on CPU and GPU (albeit the exact speedup will vary greatly depending on your exact configuration). At that point, we could re-train this model for the right number of epochs (8), and run it on the test set. This is a convincing demonstration that a 1D convnet can offer a fast, cheap alternative to a recurrent network on a word-level sentiment classification task.
Combining CNNs and RNNs to process long sequences
Because 1D convnets process input patches independently, they are not sensitive to the order of the timesteps (beyond a local scale, the size of the convolution windows), unlike RNNs. Of course, in order to be able to recognize longer-term patterns, one could stack many convolution layers and pooling layers, resulting in upper layers that would "see" long chunks of the original inputs -- but that's still a fairly weak way to induce order-sensitivity. One way to evidence this weakness is to try 1D convnets on the temperature forecasting problem from the previous section, where order-sensitivity was key to produce good predictions. Let's see:
Here are our training and validation Mean Absolute Errors:
The validation MAE stays in the low 0.40s: we cannot even beat our common-sense baseline using the small convnet. Again, this is because our convnet looks for patterns anywhere in the input timeseries, and has no knowledge of the temporal position of a pattern it sees (e.g. towards the beginning, towards the end, etc.). Since more recent datapoints should be interpreted differently from older datapoints in the case of this specific forecasting problem, the convnet fails at producing meaningful results here. This limitation of convnets was not an issue on IMDB, because patterns of keywords that are associated with a positive or a negative sentiment will be informative independently of where they are found in the input sentences.
One strategy to combine the speed and lightness of convnets with the order-sensitivity of RNNs is to use a 1D convnet as a preprocessing step before a RNN. This is especially beneficial when dealing with sequences that are so long that they couldn't realistically be processed with RNNs, e.g. sequences with thousands of steps. The convnet will turn the long input sequence into much shorter (downsampled) sequences of higher-level features. This sequence of extracted features then becomes the input to the RNN part of the network.
This technique is not seen very often in research papers and practical applications, possibly because it is not very well known. It is very effective and ought to be more common. Let's try this out on the temperature forecasting dataset. Because this strategy allows us to manipulate much longer sequences, we could either look at data from further back (by increasing the lookback
parameter of the data generator), or look at high-resolution timeseries (by decreasing the step
parameter of the generator). Here, we will chose (somewhat arbitrarily) to use a step
twice smaller, resulting in twice longer timeseries, where the weather data is being sampled at a rate of one point per 30 minutes.
This is our model, starting with two Conv1D
layers and following-up with a GRU
layer:
Judging from the validation loss, this setup is not quite as good as the regularized GRU alone, but it's significantly faster. It is looking at twice more data, which in this case doesn't appear to be hugely helpful, but may be important for other datasets.
Wrapping up
Here's what you should take away from this section:
In the same way that 2D convnets perform well for processing visual patterns in 2D space, 1D convnets perform well for processing temporal patterns. They offer a faster alternative to RNNs on some problems, in particular NLP tasks.
Typically 1D convnets are structured much like their 2D equivalents from the world of computer vision: they consist of stacks of
Conv1D
layers andMaxPooling1D
layers, eventually ending in a global pooling operation or flattening operation.Because RNNs are extremely expensive for processing very long sequences, but 1D convnets are cheap, it can be a good idea to use a 1D convnet as a preprocessing step before a RNN, shortening the sequence and extracting useful representations for the RNN to process.
One useful and important concept that we will not cover in these pages is that of 1D convolution with dilated kernels.