Seq2Seq is a machine learning architecture based on the encoder-decoder paradigm. It is widely used for tasks such as translation, Q&A and other cases where it is desirable to produce a sequence from another. The main idea is to have one model, for example an RNN, which can create a good representation of the input sequence. We will refer to this model as the ‘encoder’. Using this representation, another model, the ‘decoder’, produces the expected output sequence.
Introducing the SOS and EOS tokens
One issue to be confronted when choosing a seq2seq approach is the possible variation in both input and output sequence lengths. Variation in the input sequence size can be handled by padding the data, thus setting a fixed input sequence length. In this case, the correct sequence representation between the encoder and the decoder will not be the last output from the encoder, but the one corresponding to the actual sequence end.
Another possible variation is between the input and output sequence sizes. To handle that, we can introduce the SOS (Start Of Sequence) and EOS (End Of Sequence) tokens. By adding the EOS token at the end of each input we provide a consistent signal, facilitating the system’s capacity of learning how to finish the creation of the new sequence. We can then ask the decoder to give as many tokens as it wants until it raises the EOS token to signal the end of the output sequence.
Let’s use the following figures to better understand how it works.
The ENCODER loops over the sequence. At each step it takes the encoded representation of a token (it could be any representation that you want, word-embedding for example) and passes it through the recurrent cell (LSTM cell in the figure), that also gets an input of the state of the sentence representation so far. When EOS is reached, you collect the final state of your cell (cell_state, hidden_state).
The DECODER shares the same basic architecture, with one added layer (here a perceptron layer) to predict the new token.
But how can the decoder create a new sentence?
The encoder provides a representation of the whole sequence. One option is to initialize the decoder state with this representation, then just send an SOS token to start the generation of the new sequence.
In order to provide to the decoder a maximum of information about its progression, we can pass the decoded token as the new input when generating the next one. We repeat the process until the decoder raises the EOS signal. One potential problem with this approach is that if the decoded token is not the right one, the chance to predict the next token correctly will decrease significantly and the error will accumulate. One surprising approach to confront this problem during training is to provide, as input to the decoder, the expected output instead of the actual predicted one. During inference, this approach cannot be implemented, as the correct output sequence is unknown. Instead, we can consider a solution called beam search.
Unlike greedy decoding, which produces only the most probable word as its prediction at any moment, beam search produces the n most probable words. The next token then produces n predictions based on each of the predictions for the previous one, and so on…
Example using TenserFlow in eager mode
Let’s try to understand these concepts better through an example. In the following we will see how to implement an encoder-decoder pair using TensorFlow in eager mode. Here is the documentation associated.
For this presentation of a Seq2Seq with tensorflow in eager execution, I assume you have the following data:
- Input data X, a, list of encoded sentences (using word embedding for example) and padded to have the same sequence length. The shape of the dataset is [num_samples, time_steps, embedding_dimention]
- Output data Y, the expected output for each sample
- SL, a list of each sequence length
As we will ask the decoder to find the correct token at each timestep we must provide a mapping of the vocabulary (since the last layer has to predict a word within the given vocabulary). For this purpose we will create:
- w2i, a dictionary mapping words into indexes
- i2w, a dictionary mapping indexes into words
- i2e, a dictionary mapping indexes to encoded words, here embeddings representation
First, you will need to import tensorflow and the eager module
Eager mode is easy to work with and makes TensorFlow much more intuitive, in my opinion. In order to work in eager mode, all you have to do is add this line to your script
For the encoder we will need an recurrent cell. In my example I will use an lstm cell
The encoder is straightforward to understand, we just give the encoded words to the network and store the output and the cell state at each step.
For the decoder, we will only use the last tuple (cell_state, hidden_state) from the encoder which should contain temporal information about the whole sentence.
To retrieve only the last encoded representation we must have information about each sequence length since we have padded the inputs.
We have to modify the forward function like so:
People working with recurrent networks have noticed that when the layer sees the sentence in reversed order it works better. A bidirectional RNN allows for further improvement in performance.
Let’s allow our encoder to see the sentence in a reversed manner:
To finish our encoder, let’s add a save and a load function:
Now, let’s attack the decoders implementation.
As we mentioned before, to improve training time and increase performance, the forward function will have 2 modes (training and inference mode).
The final prediction for each token in the sequence will be done by a dense layer with number of units equal to the vocabulary size.
Now to train our encoder-decoder, we just have to initialize an optimizer from tensorflow and call its minimize function. This function requires a callable function which returns a float, representing the current loss. This result is then used to train the network.
The get_loss function receives the final cell state and hidden state from the encoder. It calls the decoder to get the predictions and then computes the loss.
Let’s define our cost function. As we have padded the data, we must be careful and make sure we actually compute a value that would help us learn exactly what we want.
In this case, we use the cross entropy as loss function:
When you use the cross entropy loss do not forget that your output must be interpretable as a probability distribution (ie the sum of the values for each class must be equal to 1) because if it’s not and your network is free to output whatever it wants, the easiest way for it to reduce the loss to zero is to have an output near to zero. This would make trainables (weights of your network) move to get this output and your network will learn nothing about your task but how to output zero
(that’s why we use a softmax after our dense layer with linear activation).
The get_loss function is defined as follows:
That’s it, you have a Seq2Seq ;).
From experience we know that initializing the network correctly may improve the learning process, both in terms of performance and learning speed.
There is a lot of ways to initialize our encoder-decoder. One example is to initialize each lstm layer by temporarily adding a dense layer in order to transform the network into a classifier. Another example could be to ask your network to repeat the given input. (ie. if you give the network the sentence “my name is Brian” it will return “my name is Brian”).
While beam search was not included in this example, it is not very hard to implement and other examples can easily be found online. For more information, I invite you to take a look at this great tutorial and this very informative blog post.