Introduction to Word Embeddings
Word Representation
One-hot representation
So far, we have used a vocabulary and one-hot representation $$Man(5391)= \begin{bmatrix} 0\\ \vdots \\ 0 \\ 1\\0\\ \vdots\\ 0 \end{bmatrix}=O_{5391}$$ $$Woman(9853)= \begin{bmatrix} 0\\ \vdots \\ 0 \\ 1\\0\\ \vdots\\ 0 \end{bmatrix}=O_{9853}$$ It's hard to recognize the relationship between 2 vectors, because inner product of any 2 one-hot vectors is 0.
Featurized representation: word embedding
Instead of using one-hot encoding, to represent a word, you can choose some features like Gender, Royal, Age, Food .. Size, Cost.
##Using word embeddings Using transfer learning and word embeddings can be a good approach to slove NLP problems.
Here are the 3 steps that you need to do when considering this approach.
Properties of word embeddings
One interesting property is if we take two vectors like $e_{man}$ and $e_{woman}$, $e_{man}-e_{woman}$ can give the main difference between 2 vectors. To put this into an algorithm, for example, if you want to find what is the corresponding word of "king" in the way like "man" and "woman", we need to calculate the similarity. Two similarity calculation methods:
- ** Cosine similarity **
- ** Euclidian distance **
Embedding Matrix
You can put all Use an embeddinng matrix and one-hot vector can get an embedding vector easily.
But in practice, use special function to look up an embedding.
Learning Word Embedddings: Word2vec & GloVe
Learning word embedddings
A learning model can be like: word embedding => neural network with parameters $w^{[1]}$ and $b^{[1]}$ => softmax with parameters $w^{[2]}$ and $b^{[2]}$
A common algorithm for learning the matrix E looks like the following image, we want to predict the target word "juice" using context words"a glass of orange". Context can vary from last few words to next few words or both. Pairs of context/target are import
Word2Vec Skip-gram model
Word2Vec is a simple way to learn word embeddings.
When apply Skip-grams for context/target selection, we randomly pick up words in a window as context/target.
A problem with softmax classification, every time you need a propobility of a target word, you calculate a sum whose length equals to that of vocabulary. When you have many words in the vocabulary, it can be very slow.
So instead of using softmax classification, hierarchical softmax can be a more proper way. You can use a tree to figure out that a word locates in which part of a vocabulary.
For instance, the target can be chosen within a windows of -10/10 words. But choosing a context can be more tricky, you should avoid using frequent non-sense words like "the" "and" "to" "a".
A even simpler solution to this problem is Negative Sampling.
Negative Sampling
We create a supervised problem, given pairs of context/target as x, y should matching results in 0/1. To form a dataset, 5-20 pairs for small datasets and 2-5 pairs for larger datasets.
We only train the model with 1 correct pair and 4 other incorrect pairs.
To select negative samples, the authur proposed a way:
GloVe word vectors
We define pairs of context/target in the following way:
The model looks like: You can't guarantee that an axis is well aligned to a feature.
#Allications using Word Embeddings
Sentiment Classification
A simple sentiment classification model takes the average of each word embedding vector.
Use a RNN can avoid disadvantages of some special sentences.
Debiasing word embeddings
What is bias in word embedding?