Path: blob/master/keras-seq2seq-signal-prediction.ipynb
58 views
Keras implementation of a sequence to sequence model for time series prediction using an encoder-decoder architecture.
I created this post to share a flexible and reusable implementation of a sequence to sequence model using Keras.
I drew inspiration from two other posts:
"Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Prediction" by Guillaume Chevalier.
"A ten-minute introduction to sequence-to-sequence learning in Keras" by François Chollet.
I strongly recommend visiting Guillaume's repository for some great projects. François Chollet is the primary author and currently the maintainer of Keras. His post presents an implementation of a seq2seq model for machine translation.
Context
Time series prediction is a widespread problem. Applications range from price and weather forecasting to biological signal prediction.
This post describes how to implement a Recurrent Neural Network (RNN) encoder-decoder for time series prediction using Keras. I will focus on the practical aspects of the implementation, rather than the theory underlying neural networks, though I will try to share some of the reasoning behind the ideas I present. I assume a basic understanding of how RNNs work. If you need to catch up, a good place to start is the classic "Understanding LSTM Networks" by Christopher Olah.
What is an encoder-decoder and why are they useful for time series prediction?
The simplest RNN architecture for time series prediction is a "many to one" implementation.
A "many to one" recurrent neural net takes as input a sequence and returns one value. For a more detailed description of the difference between many to one, many to many RNNs etc. have a look at this Stack Exchange answer
How can a "many to one" neural network be used for time series prediction? A "many to one" RNN can be seen as a function f, that takes as input n steps of a time series, and outputs a value. An RNN can, for instance, be trained to intake the past 4 values of a time series and output a prediction of the next value. Let X be a time series and Xt the value of that time series at time t, then
f(Xt-3, Xt-2, Xt-1, Xt) = Xpredictedt+1
The function f is composed of 4 RNN cells and can be represented as following:
If more than one prediction is needed (which is often the case) then the value predicted can be used as input and a new prediction can be made. Following is a representation of 3 runs through a RNN model to produce predictions for 3 steps in the future.
f(Xt-2, Xt-1, Xt, Xpredictedt+1) = Xpredictedt+2
As you can see, the basis of the prediction model f is a single unit, the RNN cell, that takes as input Xt and the state of the network (not represented in these graphs for clarity) and ouputs a single value (discarded unless all the input values have been input to the cell). The function f described above is evaluated by running the cell of the network 4 times, each time with a new input and the state output from the previous step.
There are multiple reasons why this architecture might not be the best for time series prediction, compounding errors is one. However, in my opinion, there is a more important reason as to why it might not be the best method. In a time series prediction problem there are intuitively two distinct tasks. Human beings predicting a time series would proceed by looking at the known values of the past, and use their understanding of what happened in the past to predict the future values. These two tasks require two distinct skillsets:
The ability to look at the past values and create an idea of the state of the system in the present.
The ability to use that understanding of the state of the system to predict how the system will evolve in the future.
By using a single RNN cell in our model we are asking it to be capable of both memorising important events of the past and using these events to predict future values. This is the reasoning behind considering the encoder-decoder for time series prediction. Rather than having a single multi-tasking cell, the model will use two specialised cells. One for memorising important events of the past (encoder) and one for converting the important events into a prediction of the future (decoder).
This idea of having two cells (an encoder and a decoder) is used in other maching learning tasks, the most prominent being perhaps machine translation. In machine translation, the idea behind having two separate tasks is even clearer. Let's say we're creating a system that translates French to English. First we need an element (encoder) that is capable of understanding French, its only task is to understand the input sentence and create a representation of what that sentence means. Then we need a second system (decoder) that is capable of converting a representation of the meaning of the French sentence to a sentence in English with the same meaning. Instead of having a super intelligent cell that can understand French and speak English, we can create two cells, the encoder understands French but cannot speak English and the decoder knows how to speak English but cannot understand French. By working together, these specialised cells outperform the super cell.
How to create an encoder-decoder for time series prediction in Keras?
Now that we have an explanation as to why an encoder-decoder might work, we are going to implement one.
We will be training our model on an artificially generated dataset. Our time series will be composed of the sum of 2 randomly generated sine waves (random amplitude, frequency, phase and offset). The idea to use such a dataset came from Guillaume Chevalier (link in the beginning of the notebook) although I rewrote his functions to suit my needs. The dataset generators will be imported from utils.py at the root of this repository. This code is python 3 compatible (some things won't work in python 2).
Import modules/packages
Hyperparameters and model configuration
This model uses a Gated Recurrent Unit (GRU). Other units (LSTM) would also work with a few modifications to the code.
Create model
Create encoder
The encoder is first created by instantiating a graph, which is a description of the operations applied to the tensors (that will later hold the data). This is common among many neural network frameworks.
Create decoder
The decoder is created similarly to the encoder
Create model and compile
A notable detail here, are the inputs to the model. The train model has two inputs : encoder_inputs and decoder_inputs. What encoder_inputs should be is clear, the encoder_inputs should hold the input series. But what about the decoder inputs?
In machine translation applications (see "A ten minute introduction to sequence-to-sequence learning in keras") something called teacher forcing is used. In teacher forcing, the input to the decoder during training is the target sequence shifted by 1. This supposedly helps the decoder learn and is an effective method for machine translation. I tested teacher forcing for sequence prediction and the results were bad. I am not entirely sure why this is the case, my intuition is that unlike machine translation, if you feed the decoder the correct sequence shifted by one, your model becomes "lazy" because it only has to look at the value input in the step before and apply a small modification to it. In other words, the gradients of the truncated back propagation beyond the n-1 step will be very small and the model will develop a short memory. However, if the input to the decoder is 0, it forces the model to really memorize the values that are fed to the encoder since it has nothing else to work on. In some sense, teacher forcing might artificially induce vanishing gradients.
I suggest you look into the random_sine function of the utils module of this repository. You will see the the decoder input is simply set to zero. You may be asking yourselves why the decoder has an input if it's set to zero...The reason is simple, Keras RNNs must take an input value.
Fit model to data
I like using the fit_generator in Keras. In this case it's not really useful/necessary since my training examples easily fit into memory. In cases when the training data doesn't fit into memory, the fit_generator is definitely the way to go. A standard python generator is usually fine for the fit_generator function, however, Keras provides a nice class keras.utils.Sequence that you can inherit from to create your own generator. This is a requirement to guarantee that the elements of the generator are only selected once in the case of multiprocessing (which isn't guaranteed with the standard generator). A simple example of using the data generators in Keras is "A detailed example of how to use data generators with Keras" by Shervine Amidi.
It is now possible to use this model to make predictions:
A limitation of using this model to make the predictions is that we can only predict a sequence of same length as the training data. This can be a problem if we want to predict less or more than the training sequence lengths. In the next section I will show how to create "prediction" models that allow to predict sequences of arbitrary length.
Create "prediction" models
When using the encoder-decoder to predict a sequence of arbitrary length, the encoder first encodes the entire input sequence. The state of the encoder is then fed to the decoder which then produces the output sequence sequentially. Although a new model is being created with the keras.models.Model class, the input and output tensors of the model are the same as those used during training, hence the weights of the layers applied to the tensors are preserved.
As you will see, creating the prediction models also gives the ability to inspect the state of the model at different points throughout the prediction process. We could study how the encoder creates a representation of the input data. For instance, how does the model represent the offset? Or the frequency? Does it decompose the signal into it's constituent sine waves and represent them as different dimensions of the state vector? These are very interesting questions for another time.
Next steps & Discussion
There are many things that could be done to either extend or improve this model. Here are a few ideas.
There's no reason why the encoder and decoder should have the same complexity or the same number of layers. As well as doing a simple hyper parameter search, it could be interesting to implement a model with different encoder and decoder sizes. To do this, one would have to add a dense layer after retrieving the states of the encoder to transform them into the correct size.
Encapsulate the encoder-decoder by creating a class with a fit/predict interface. This is actually something I have done, it's extremely useful as it allows to instantiate seq2seq models as easily as one would instantiate a scikit learn model.
Add the ability to add context vectors to the state output by the encoder. The encoder is able to produce an input vector for the decoder based on the time series. It is possible to add constant features to the model by duplicating them at each input timestep. However, adding the ability to extend the encoder output state with a constant vector that represents context might also be a good idea (for example, if you're predicting the evolution of housing prices, you might want to tell your model which geographical area you are in, since prices might not evolve in the same manner depending on location). This is not the attention mechanism often used in NLP that also produces what is called a context vector(a context vector that is updated at each step of the decoder). But since adding attention to NLP seq2seq applications has hugely improved state of the art. It might also be worth looking into attention for sequence prediction.
As described above, study how the encoder creates a representation of the input sequence by looking at the state vector.
It appears that our model struggles on signals that have low frequency, one explanation might be that the model must "see" at least a certain number of periods to determine the frequency of the signal. An interesting questions to answer might be: How many periods of the constituent signals are required for the model to be accurate?
Although our model was only train on an output sequence of length 15, it appears to be able to predict beyond that limit, this is something we can exploit with the prediction models.
Thanks for reading 😃
I welcome questions or comments, you can find me on LinkedIn.
Author: Luke Tonin LinkedIn: https://fr.linkedin.com/in/luketonin Github: https://github.com/LukeTonin/