Various sequence to sequence qrchitectures
Basic Models
Machine translation
For a machine tanslation, you need a encoder-decoder model. In this model, each block can represent a RNN/LSTM/GRU.
Image captioning
How to interpret an image into a caption? In the example, we use a pre-trainde Alex network as encoders. Then we also need sevaral RNN blocks as decoders.
Picking the most likely sentence
A machine translation problem can be regarded as a conditional language model, and "conditional" means given the probability of the original sentence.
###Why not a greedy search? greedy search: find the most likely word for the position 1, then for the position 2, etc.. It doesn't work well when we need to translate a whole sentence.
Beam Search
step 1
B=3 means that 3 most likely words are chosen.
step 2
For each candidate word chosen in step 1, run RNN blocks to find out the most likely pairs of $y^{<1>},y^{<2>}$, then cut down these 30000 possibilities into 3 most possible.
step 3
Same pocesse like step 2.
Refinements to Beam Search
Length normalization
$P(y^{<1>}...y^{<T_y>}|x)$ can be thought as a product of many conditional probabilities, which can be too small.
The result of model tends to be short sentences, because the product of probability of short sentence is likely to be bigger.
To avoid this disadvantage, one approach is dividing this product by number of words in the sentence.
Beam search discussion
Beam width B is a compromise of performance and speed.
Error analysis in beam search
When there is an error, we want to be clear this error comes from RNN or beam search.
Blue Score
There may be several possible translation for a sentence. How to choose one? A way is to count the max number of words that appear in the reference.
Blue score on bigrams
Blue score on unigrams
BP= brevity penalty
##Attention model Intuition Encoder-decoder model can lose performance when a sentence is too long.
In a bidirectional RNN, you can add an attention weight to each word, which measures how much attention you should pay attention to this word. Then these weights are sent to another RNN to evaluate attentions.
More clearly, $\alpha^{<t,t\prime>}= amount\ of\ attention\ that\ y^{<t>}\ should\ pay\ to\ a^{<t\prime>}$
Computing attention
Use a small NN to get proper e
Speech recognition - Audio Data
Speech recognition
We can also apply an attention model to speech recognition.
Another approach is Connectionist temporal classification.
Trigger Word Detection
Use 0/1 to distinguish trigger words or nornam words.