Why sequence models

Examples of sequence data

speech recognition
Music generation
Sentiment classification
DNA sequence analysis
Machine translation
Video activity recognition
Name entity recognition

Notation

Motivating example

Name entity recognition

x	Harry	Potter	and	Hermione	Granger	invented	a	new	spell.
y	1	1	0	1	1	0	0	0	0

$$x^{<1>} \to x^{<9>} stand\ for\ the\ 1st\ to\ 9th\ element\ in\ x\ vector, T_x=9\ length=9$$

$$y^{<1>} \to y^{<9>} stand\ for\ the\ 1st\ to\ 9th\ element\ in\ y\ vector, T_y=9\ length =9$$

$x^{(i)<t>}$ refers to $(i)$th training example's $<t>$th element.

$T_x^{(i)}$ would be the input sequence length for training example i.

Representing words

To represent words in a sentence, firstly we need a Vocabulary/Dictionary which is a list of the words that you will use in your representations.

a	aaron	...	and	...	harry	...	potter	...	zulu
1	2	...	367	...	4075	...	6830	...	10000

Use one-hot encoding

$$x^{<1>}= \begin{bmatrix} 0\\ \vdots \\ 0\\ 1\leftarrow4075\\ 0\\ \vdots \\ 0 \end{bmatrix}; x^{<2>}=\begin{bmatrix} 0\\ \vdots \\ 0\\ 1\leftarrow6830\\ 0\\ \vdots \\ 0 \end{bmatrix}; x^{<3>}=\begin{bmatrix} 0\\ \vdots \\ 0\\ 1\leftarrow367\\ 0\\ \vdots \\ 0 \end{bmatrix}; x^{<7>}=\begin{bmatrix} 1\\ \vdots \\ 0\\ 0\\ 0\\ \vdots \\ 0 \end{bmatrix}$$ When a word can't be found in the Vocabulary, we shall add $<UNK>$ to represent this word.

Recurrent Neural Networks

Why not a standard network?

Some problems:

Inputs, outputs can be different lengths in different examples.
Doesn't share features learned across different positions of text.

Recurrent Neural Networks

$y^{<i>}$ not only depends on the current input$x^{<i>}$, but also depends on former inputs $x^{<i-1>} \dots x^{1}$.

Each layer passes an activation function $a^{<i>}$ to next layer $<i+1>$, at the beginning, we will need a $a^{<0>}$ which is usually a vector of zeros.

Uni-direction vs Bi-direction

There is a problem for this Uni-diretional RNN:

The current level output can only consider current and former inputs.

But sometimes you have to look at following words, for example:

He said, "Teddy Roosevelt was a great President."
He said, "Teddy bears are on sale!"

In this case, the first three words are same, in order to figure out what "Teddy" exactely refers to, you should take the following words like "Roosevelt" or "bears" into considerations. So, there comes a RNN model that considers both former and following words, which is called Bidirectional RNN(BRNN).

Forward Propagation

At the ever beginning $$a^{<0>}=\vec{0}$$ For the first level $$a^{<1>}=g_1(w_{aa}a^{<0>}+w_{ax}x^{<1>}+b_a)\leftarrow \underline{tanh}\ or\ ReLU\ \hat y^{<1>}=g_2(w_{ya}a^{<1>}+b_y)\leftarrow\ sigmoid$$ More generally $$a^{<t>}=g_1(w_{aa}a^{<t-1>}+w_{ax}x^{<t>}+b_a)\leftarrow \underline{tanh}\ or\ ReLU\ \hat y^{<t>}=g_2(w_{ya}a^{<t>}+b_y)\leftarrow\ sigmoid$$

Simplified RNN notation

You can write the above formula like: $$a^{<t>}=g_1(w_{a}\binom{a^{<t-1>}}{x{<t>}}+b_a) \ [w_{aa}\ w_{ax}]=w_a$$

Backpropagation through time

Usually, a DL framework can automatically take care of backpropagation.

Defining a loss function to finish the backpropogation.

Different types of RNNs

When we have length of input equals to length of output, it's called: Many-to-many architecture: $T_x=T_y$.

By contrast, for a sentiment classification problem,$x=text\ ;\ y=0/1 \ or 1..5 $, rather than using every input to have an output, you can just let RNN read the whole sentence and get an output at the last step. This model is called Many-to-one architecture.

There is also one-to-one architecture, which is maybe less interesting.

In the end, you can have one-to-many architecture that can be used for music generation. You provide an integer as the gender of music or as a first key, you get a chapiter of music.

Another example is Machine translation, it's another version of many-to-many architecture, but there are two distinguishable parts: "encoder"(pure inputs) and "decoder"(pure outputs).

Languages model and sequence generation

Speech recognition

For example, when I say

The apple and pear salad were delicious.

even for those best speech recognition systems, it's hard to figure out

The apple and pair salad
The apple and pear salad

A speech recognition system could give a probability to each sentence. If the second one has a higher probability, it will be chosen.

language modelling with an RNN

Trainng set: large corpus of english text.

You need firstly tokenize the text.

Cats average 15 hours of sleep a day. EOS => $y^{<1>}\ y^{<2>}\ y^{<3>}\ ... y^{<9>}$$$<EOS>= end\ of\ sentence$$

The Egyptian Mau is a bread of cat. The special word Mau=>$<UNK>=Unknown$

RNN model

At each step of a RNN network, the output should be a probability of all words in a dictionary given last words.

To train this network, the loss function at step t could be $L(\hat{y}^{<t>},y^{<t>})=-\sum{y_i^{<t>}log\hat{y}_i^{<t>}}$. The overall loss is $L=\sum{L^{<t>}(\hat{y}^{<t>},y^{<t>})}$.

With thismodel, we can predict the next word given a set of words.

Sampling novel sequences

Sampling a sequence from a trained RNN

Word-level language model

Use words as inputs to train RNN models

Character-level language model

Use characters as inputs to train RNN models pros:

Don't have to worry about UNK cons:
you end up with much longer sequences, 10-20 words equal 100-200 characters, so it's harder to capture dependencies from former characters
more computationally expensive to train

Vanishing gradients with RNNs

Just like deep neural networks, it can be quite difficult to let the later time steps affect the earlier computations. But for RNNs, it's possible to have both vanishing and exploding gradients.

A possible solution for exploding gradients is to apply gradient clipping, which means looking at your gradient vector, if it's bigger than some thresholds, resize these gradient vectors.

Gated Recurrent Unit(GRU)

GRU is a possible solution to vanishing gradients.

GRU(simplified)

For example, there is a sentence "The cat, which already ate... , was full ", you have to memorize it's "cat" or "cats" to decide "was" or "were". $$c=memory\ cell$$ $$c^{<t>}=a^{<t>}$$ Then there is a candidate for replacing the memory cell: $$\tilde{c}^{<t>}=tanh(w_c[c^{<t-1>},x^{<t>}]+b_c)$$ $$\Gamma_u=\sigma(w_u[c^{<t-1>},x^{<t>}]+b_u), u=update$$ The $\Gamma_u$ should be 1 at the word "cat", 0 at other words.

For a sigmoid function, the result is a number between 0 and 1, and most time it's either 0 or 1. So it's called Gate.

Then use the following function to update memory cells: $$c^{<t>}=\Gamma_u*\tilde{c}^{<t>}+(1-\Gamma_u)*c^{<t-1>}$$

Full GRU

In the full version GRU, we add a new gate $\Gamma_r$ in the candidate function, r stands for relevance. The main structure keeps unchanged.

Long Short term Memory(LSTM)

Another solution, maybe even more powerful than GRU in some cases, is LSTM. Instead of using two gates, an update, a forget gate and an output gate are used in LSTM. At each time step, we use those three gates to update memory cells.

LSTM came out earlier than GRU, but in different cases, they could have different performances.

Bidirectional RNN

In the previous models, there are only forwards connections, but in BRNN, there will be also backwards connections. A Bidirectional structure with LSTM blocks is a first thing to try for recent NLP problems.

Deep RNNs

In a deep RNN, we shall use a chained blocks as a layer. So to compute a block, both horizontal and vertical inputs are needed. Like $a^{[2]<3>}=g(w_a^{[2]}[a^{[2]<2>},a^{[1]<3>}]+b_a^{[3]})$

Usually there aren't many horizontal layers, but we add some vertical connections at the last layer outputs.