Sequence Model week3

Various sequence to sequence qrchitectures

Basic Models

Machine translation

For a machine tanslation, you need a encoder-decoder model. In this model, each block can represent a RNN/LSTM/GRU.

Image captioning

How to interpret an image into a caption? In the example, we use a pre-trainde Alex network as encoders. Then we also need sevaral RNN blocks as decoders.

Picking the most likely sentence

A machine translation problem can be regarded as a conditional language model, and "conditional" means given the probability of the original sentence.

###Why not a greedy search? greedy search: find the most likely word for the position 1, then for the position 2, etc.. It doesn't work well when we need to translate a whole sentence.

Beam Search

step 1

B=3 means that 3 most likely words are chosen.

step 2

For each candidate word chosen in step 1, run RNN blocks to find out the most likely pairs of $y^{<1>},y^{<2>}$, then cut down these 30000 possibilities into 3 most possible.

step 3

Same pocesse like step 2.

Refinements to Beam Search

Length normalization

$P(y^{<1>}...y^{<T_y>}|x)$ can be thought as a product of many conditional probabilities, which can be too small.

The result of model tends to be short sentences, because the product of probability of short sentence is likely to be bigger.

To avoid this disadvantage, one approach is dividing this product by number of words in the sentence.

Beam search discussion

Beam width B is a compromise of performance and speed.

Error analysis in beam search

When there is an error, we want to be clear this error comes from RNN or beam search.

Blue Score

There may be several possible translation for a sentence. How to choose one? A way is to count the max number of words that appear in the reference.

Blue score on bigrams

Blue score on unigrams

BP= brevity penalty

##Attention model Intuition Encoder-decoder model can lose performance when a sentence is too long.

In a bidirectional RNN, you can add an attention weight to each word, which measures how much attention you should pay attention to this word. Then these weights are sent to another RNN to evaluate attentions.

More clearly, $\alpha^{<t,t\prime>}= amount\ of\ attention\ that\ y^{<t>}\ should\ pay\ to\ a^{<t\prime>}$

Computing attention

Use a small NN to get proper e

Speech recognition - Audio Data

Speech recognition

We can also apply an attention model to speech recognition.

Another approach is Connectionist temporal classification.

Trigger Word Detection

Use 0/1 to distinguish trigger words or nornam words.