Why sequence models
Examples of sequence data
- speech recognition
- Music generation
- Sentiment classification
- DNA sequence analysis
- Machine translation
- Video activity recognition
- Name entity recognition
Notation
Motivating example
Name entity recognition
x | Harry | Potter | and | Hermione | Granger | invented | a | new | spell. |
---|---|---|---|---|---|---|---|---|---|
y | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
$$x^{<1>} \to x^{<9>} stand\ for\ the\ 1st\ to\ 9th\ element\ in\ x\ vector, T_x=9\ length=9$$
$$y^{<1>} \to y^{<9>} stand\ for\ the\ 1st\ to\ 9th\ element\ in\ y\ vector, T_y=9\ length =9$$
$x^{(i)<t>}$ refers to $(i)$th training example's $<t>$th element.
$T_x^{(i)}$ would be the input sequence length for training example i.
Representing words
To represent words in a sentence, firstly we need a Vocabulary/Dictionary which is a list of the words that you will use in your representations.
a | aaron | ... | and | ... | harry | ... | potter | ... | zulu |
---|---|---|---|---|---|---|---|---|---|
1 | 2 | ... | 367 | ... | 4075 | ... | 6830 | ... | 10000 |
Use one-hot encoding
$$x^{<1>}= \begin{bmatrix} 0\\ \vdots \\ 0\\ 1\leftarrow4075\\ 0\\ \vdots \\ 0 \end{bmatrix}; x^{<2>}=\begin{bmatrix} 0\\ \vdots \\ 0\\ 1\leftarrow6830\\ 0\\ \vdots \\ 0 \end{bmatrix}; x^{<3>}=\begin{bmatrix} 0\\ \vdots \\ 0\\ 1\leftarrow367\\ 0\\ \vdots \\ 0 \end{bmatrix}; x^{<7>}=\begin{bmatrix} 1\\ \vdots \\ 0\\ 0\\ 0\\ \vdots \\ 0 \end{bmatrix}$$ When a word can't be found in the Vocabulary, we shall add $<UNK>$ to represent this word.
Recurrent Neural Networks
Why not a standard network?
Some problems:
- Inputs, outputs can be different lengths in different examples.
- Doesn't share features learned across different positions of text.
Recurrent Neural Networks
$y^{<i>}$ not only depends on the current input$x^{<i>}$, but also depends on former inputs $x^{<i-1>} \dots x^{1}$.
Each layer passes an activation function $a^{<i>}$ to next layer $<i+1>$, at the beginning, we will need a $a^{<0>}$ which is usually a vector of zeros.
Uni-direction vs Bi-direction
There is a problem for this Uni-diretional RNN:
- The current level output can only consider current and former inputs.
But sometimes you have to look at following words, for example:
- He said, "Teddy Roosevelt was a great President."
- He said, "Teddy bears are on sale!"
In this case, the first three words are same, in order to figure out what "Teddy" exactely refers to, you should take the following words like "Roosevelt" or "bears" into considerations. So, there comes a RNN model that considers both former and following words, which is called Bidirectional RNN(BRNN).
Forward Propagation
At the ever beginning $$a^{<0>}=\vec{0}$$ For the first level $$a^{<1>}=g_1(w_{aa}a^{<0>}+w_{ax}x^{<1>}+b_a)\leftarrow \underline{tanh}\ or\ ReLU\ \hat y^{<1>}=g_2(w_{ya}a^{<1>}+b_y)\leftarrow\ sigmoid$$ More generally $$a^{<t>}=g_1(w_{aa}a^{<t-1>}+w_{ax}x^{<t>}+b_a)\leftarrow \underline{tanh}\ or\ ReLU\ \hat y^{<t>}=g_2(w_{ya}a^{<t>}+b_y)\leftarrow\ sigmoid$$
Simplified RNN notation
You can write the above formula like: $$a^{<t>}=g_1(w_{a}\binom{a^{<t-1>}}{x{<t>}}+b_a) \ [w_{aa}\ w_{ax}]=w_a$$
Backpropagation through time
Usually, a DL framework can automatically take care of backpropagation.
Defining a loss function to finish the backpropogation.
Different types of RNNs
When we have length of input equals to length of output, it's called: Many-to-many architecture: $T_x=T_y$.
By contrast, for a sentiment classification problem,$x=text\ ;\ y=0/1 \ or 1..5 $, rather than using every input to have an output, you can just let RNN read the whole sentence and get an output at the last step. This model is called Many-to-one architecture.
There is also one-to-one architecture, which is maybe less interesting.
In the end, you can have one-to-many architecture that can be used for music generation. You provide an integer as the gender of music or as a first key, you get a chapiter of music.
Another example is Machine translation, it's another version of many-to-many architecture, but there are two distinguishable parts: "encoder"(pure inputs) and "decoder"(pure outputs).
Languages model and sequence generation
Speech recognition
For example, when I say
The apple and pear salad were delicious.
even for those best speech recognition systems, it's hard to figure out
- The apple and pair salad
- The apple and pear salad
A speech recognition system could give a probability to each sentence. If the second one has a higher probability, it will be chosen.
language modelling with an RNN
Trainng set: large corpus of english text.
You need firstly tokenize the text.
Cats average 15 hours of sleep a day. EOS => $y^{<1>}\ y^{<2>}\ y^{<3>}\ ... y^{<9>}$$$<EOS>= end\ of\ sentence$$
The Egyptian Mau is a bread of cat. The special word Mau=>$<UNK>=Unknown$
RNN model
At each step of a RNN network, the output should be a probability of all words in a dictionary given last words.
To train this network, the loss function at step t could be $L(\hat{y}^{<t>},y^{<t>})=-\sum{y_i^{<t>}log\hat{y}_i^{<t>}}$. The overall loss is $L=\sum{L^{<t>}(\hat{y}^{<t>},y^{<t>})}$.
With thismodel, we can predict the next word given a set of words.
Sampling novel sequences
Sampling a sequence from a trained RNN
Word-level language model
Use words as inputs to train RNN models
Character-level language model
Use characters as inputs to train RNN models pros:
- Don't have to worry about UNK cons:
- you end up with much longer sequences, 10-20 words equal 100-200 characters, so it's harder to capture dependencies from former characters
- more computationally expensive to train
Vanishing gradients with RNNs
Just like deep neural networks, it can be quite difficult to let the later time steps affect the earlier computations. But for RNNs, it's possible to have both vanishing and exploding gradients.
A possible solution for exploding gradients is to apply gradient clipping, which means looking at your gradient vector, if it's bigger than some thresholds, resize these gradient vectors.
Gated Recurrent Unit(GRU)
GRU is a possible solution to vanishing gradients.
GRU(simplified)
For example, there is a sentence "The cat, which already ate... , was full ", you have to memorize it's "cat" or "cats" to decide "was" or "were". $$c=memory\ cell$$ $$c^{<t>}=a^{<t>}$$ Then there is a candidate for replacing the memory cell: $$\tilde{c}^{<t>}=tanh(w_c[c^{<t-1>},x^{<t>}]+b_c)$$ $$\Gamma_u=\sigma(w_u[c^{<t-1>},x^{<t>}]+b_u), u=update$$ The $\Gamma_u$ should be 1 at the word "cat", 0 at other words.
For a sigmoid function, the result is a number between 0 and 1, and most time it's either 0 or 1. So it's called Gate.
Then use the following function to update memory cells: $$c^{<t>}=\Gamma_u*\tilde{c}^{<t>}+(1-\Gamma_u)*c^{<t-1>}$$
Full GRU
In the full version GRU, we add a new gate $\Gamma_r$ in the candidate function, r stands for relevance. The main structure keeps unchanged.
Long Short term Memory(LSTM)
Another solution, maybe even more powerful than GRU in some cases, is LSTM. Instead of using two gates, an update, a forget gate and an output gate are used in LSTM. At each time step, we use those three gates to update memory cells.
LSTM came out earlier than GRU, but in different cases, they could have different performances.
Bidirectional RNN
In the previous models, there are only forwards connections, but in BRNN, there will be also backwards connections. A Bidirectional structure with LSTM blocks is a first thing to try for recent NLP problems.
Deep RNNs
In a deep RNN, we shall use a chained blocks as a layer. So to compute a block, both horizontal and vertical inputs are needed. Like $a^{[2]<3>}=g(w_a^{[2]}[a^{[2]<2>},a^{[1]<3>}]+b_a^{[3]})$
Usually there aren't many horizontal layers, but we add some vertical connections at the last layer outputs.