Introduction to Word Embeddings

Word Representation

One-hot representation

So far, we have used a vocabulary and one-hot representation $$Man(5391)= \begin{bmatrix} 0\\ \vdots \\ 0 \\ 1\\0\\ \vdots\\ 0 \end{bmatrix}=O_{5391}$$ $$Woman(9853)= \begin{bmatrix} 0\\ \vdots \\ 0 \\ 1\\0\\ \vdots\\ 0 \end{bmatrix}=O_{9853}$$ It's hard to recognize the relationship between 2 vectors, because inner product of any 2 one-hot vectors is 0.

Featurized representation: word embedding

Instead of using one-hot encoding, to represent a word, you can choose some features like Gender, Royal, Age, Food .. Size, Cost.

##Using word embeddings Using transfer learning and word embeddings can be a good approach to slove NLP problems.

Here are the 3 steps that you need to do when considering this approach.

Properties of word embeddings

One interesting property is if we take two vectors like $e_{man}$ and $e_{woman}$, $e_{man}-e_{woman}$ can give the main difference between 2 vectors. To put this into an algorithm, for example, if you want to find what is the corresponding word of "king" in the way like "man" and "woman", we need to calculate the similarity. Two similarity calculation methods:

** Cosine similarity **
** Euclidian distance **

Embedding Matrix

You can put all Use an embeddinng matrix and one-hot vector can get an embedding vector easily.

But in practice, use special function to look up an embedding.

Learning Word Embedddings: Word2vec & GloVe

Learning word embedddings

A learning model can be like: word embedding => neural network with parameters $w^{[1]}$ and $b^{[1]}$ => softmax with parameters $w^{[2]}$ and $b^{[2]}$

A common algorithm for learning the matrix E looks like the following image, we want to predict the target word "juice" using context words"a glass of orange". Context can vary from last few words to next few words or both. Pairs of context/target are import

Word2Vec Skip-gram model

Word2Vec is a simple way to learn word embeddings.

When apply Skip-grams for context/target selection, we randomly pick up words in a window as context/target.

A problem with softmax classification, every time you need a propobility of a target word, you calculate a sum whose length equals to that of vocabulary. When you have many words in the vocabulary, it can be very slow.

So instead of using softmax classification, hierarchical softmax can be a more proper way. You can use a tree to figure out that a word locates in which part of a vocabulary.

For instance, the target can be chosen within a windows of -10/10 words. But choosing a context can be more tricky, you should avoid using frequent non-sense words like "the" "and" "to" "a".

A even simpler solution to this problem is Negative Sampling.

Negative Sampling

We create a supervised problem, given pairs of context/target as x, y should matching results in 0/1. To form a dataset, 5-20 pairs for small datasets and 2-5 pairs for larger datasets.