Word Vectors详解(1)

来源：互联网发布：matlab求最优化问题编辑：程序博客网时间：2024/06/07 02:13

We want to represent a word with a vector in NLP. There are many methods.

1 one-hot Vector

Represent every word as an ℝ|V|∗1 vector with all 0s and one 1 at the index of that word in the sorted english language. Where V is the set of vocabularies.

2 SVD Based Methods

2.1 Window based Co-occurrence Matrix

Representing a word by means of its neighbors.
In this method we count the number of times each word appears inside a window of a particular size around the word of interest.

For example:
这里写图片描述

The matrix is too large. We should make it smaller with SVD.

Generate |V|∗|V| co-occurrence matrix,X.
Apply SVD on X to get X=USVT.
Select the first k columns of U to get a k-dimensional word vectors.
∑ki=1σi∑|V|i=1σi indicates the amount of variance captured by the first k dimensions.

2.2 shortage

SVD based methods do not scale well for big matrices and it is hard to incorporate new words or documents. Computational cost for a m∗n matrix is O(mn2)

3 Iteration Based Methods - Word2Vec

3.1 Language Models (Unigrams, Bigrams, etc.)

We need to create such a model that will assign a probability to a sequence of tokens.

For example
* The cat jumped over the puddle. —high probability
* Stock boil fish is toy. —low probability

Unigrams:
We can take the unary language model approach and break apart this probability by assuming the word occurrences are completely independent:

P (w 1, w 2, . . ., w n) = \prod i = 1 n P (w i)

However, we know the next word is highly contingent upon the previous sequence of words. This model is bad.

Bigrams:
We let the probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it.

P (w 1, w 2, . . ., w n) = \prod i = 2 n P (w i | w i - 1)

3.2 Continuous Bag of words Model (CBOW)

Example Sequence:
“The cat jumped over the puddle.”

What is Continuous Bag of words Model?
We treat {“the”, “cat” , “over”, “puddle”} as a context. And the word “jumped” is the center word. Context should be able to predict the center world. This type of model we call a Continuous Bag of words Model.

Known parameters:
If the index of center word is c, then the indexes of context are c−m,...,c−1,c+1,...,c+m.
The input of the model is the one-hot vector of context. We represent it with x(c−m)...x(c−1),x(c+1)...x(c+m).
And the outputs is the one-hot vector of center word.We represent it with y(c).

Parameters we need to learn:
∈ℝn∗|V|: Input word matrix
vi: i-th column of vi, the input vector representation of word wi
∈ℝ|V|∗n: Output word matrix
ui: i-th row of ui, the output vector representation of word wi
Where n is an arbitrary size which defines the size of our embedding space.

How does it work:
1. We get our embedded word vectors for the context:

v c - m =  x c - m, v c - m + 1 =  x c - m + 1, . . .

2. Average these vectors:

v ˆ = v c - m + v c - m + 1 + . . . 2 m

3. Generate a score vector

z=vˆ. As the dot product of similar vectors is higher, it will push similar words close to each other in order to achieve a high score.
4. Turn the scores into probabilities

yˆ=softmax(z)∈ℝ|V|
5. We desire our probabilities generated

yˆ to match the true probabilities

y(c).

How to learn ,:
learn them with stochastic gradient descent. So we need a loss function.
We use cross-entropy to measure the distance between two distributions:

H (y ˆ, y) = - \sum i = 1 | V | y i log (y i^)

Consider

yˆ is a one-hot vector. Simplifies to simply:

H (y ˆ, y) = - y i log (y i^) = - l o g (y i^)

We formulate our optimization objective as:

m i n i m i z e J = - log P (w c | w c - m, . . ., w c + m) = - log P (u c | v ˆ) = - log exp ( u T c v ˆ ) \sum | V | j = 1 exp ( u T j v ˆ ) = - u T c v ˆ + log \sum j = 1 | V | exp (u T j v ˆ)

We use stochastic gradient descent to update ,.

阅读全文

1 0