CS224d Assignment1 答案, Part(3/4)

来源:互联网 发布:手机淘宝点购买没反应 编辑:程序博客网 时间:2024/06/16 07:08

Assignment1的答案一共被我分成了4部分,分别包含第1,2,3,4题。这部分包含第3题的答案。

3. word2vec (40 points + 5 bonus)

(a). (3 points) Assume you are given a predicted word vector vc corresponding to the center word c for skipgram, and word prediction is made with the softmax function found in word2vec models

y^0=p(o|c)=exp(uTovc)Ww=1exp(uTwvc)(4)

where w denotes the w-th word and uw (w=1,,W) are the “output” word vectors for all words in the vocabulary. Assume cross entropy cost is applied to this prediction and word o is the expected word (the o-th element of the one-hot label vector is one), derive the gradients with respect to vc.
Hint: It will be helpful to use notation from question 2. For instance, letting y^ be the vector of softmax predictions for every word, y as the expected word vector, and the loss function
JsoftmaxCE(o,vc,U)=CE(y,y^)(5)

where U=[u1,u2,,uW] is the matrix of all the output vectors. Make sure you state the orientation of your vectors and matrices.

解:设词向量的维度为ndim,且各词向量为列向量,即vc的维度为ndim×1U的维度为ndim×W。并且记θ=UTvc。则有y^=softmax(θ)。由第2问(b)的结果可得:

JsoftmaxCEvc=JsoftmaxCEθθvc=U(y^o)


(b)(3 points) As in the previous part, derive gradients for the “output” word vectors uw (including uo).

解:同(a)中一样,设θ=UTvc,则有:

JsoftmaxCEUij=kJsoftmaxCEθkθkUij=k(y^o)|kθkUij

其中θk=uTkvc,则θkUij={vi0j=kjk,其中vi表示vc的第i个元素。则有:
k(y^o)|kθkUij=(y^o)|jvi=vc(y^o)T|i,j

所以:
JsoftmaxCEU=vc(y^o)T


(c). (6 points) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector vc, and the expected output word is o. Assume that K negative samples (words) are drawn, and they are 1,,K, respectively for simplicity of notation (o{1,,K}). Again, for a given word, o, denote its output vector as uo. The negative sampling loss function in this case is

Jnegsample(o,vc,U)=log(σ(uTovc))k=1Klog(σ(uTkvc))(6)

where σ() is the sigmoid function.
After you’ve done this, describe with one sentence why this cost function is much more efficient to compute than the softmax-CE loss (you could provide a speed-up ratio, i.e. the runtime of the softmax-CE loss divided by the runtime of the negative sampling loss).
Note: the cost function here is the negative of what Mikolov et al had in their original paper, because we are doing a minimization instead of maximization in our code.

解:设所取的K个索引所在的集合为S

Jnegsamplevc=logσ(uTovc)vciSlogσ(uTivc)vc=[σ(uTovc)1]uoiS[σ(uTivc)1]ui

Jnegsampleuw=logσ(uTovc)uwiSlogσ(uTivc)uw=[σ(uTovc)1]vc[1σ(uTwvc)]vc0w=owSwowS

之所以(6)式比(5)式快是因为:runtime of softmax-CEruntime of negative sampling loss=O(W)O(K) (不知道这么说是不是准确,望大神指正)。


(d). (8 points) Derive gradients for all of the word vectors for skip-gram and CBOW given the previous parts and given a set of context words [wordcm,,wordc1,wordc,wordc+1,,wordc+m], where m is the context size. Denote the “input” and “output” word vectors for wordk as vk and uk respectively.
Hint: feel free to use F(o,vc) (where o is the expected word) as a placeholder for the JsoftmaxCE(o,vc,) or Jnegsample(o,vc,) cost functions in this part - you’ll see that this is a useful abstraction for the coding part. That is, your solution may contain terms of the form F(o,vc).
Recall that for skip-gram, the cost for a context centered around c is

Jskipgram(wordcmc+m)=mjm,j0F(wc+j,vc)(7)

where wc+j refers to the word at the j-th index from the center.
CBOW is slightly different. Instead of using vc as the predicted vector, we use v^ de fined below. For (a simpler variant of) CBOW, we sum up the input word vectors in the context
v^=mjm,j0vc+j(8)

then the CBOW cost is
JCBOW(wordcmc+m)=F(wc,v^)(9)

Note: To be consistent with the v^ notation such as for the code portion, for skip-gram v^=vc.

解:设vk,uk分别为词k所对应的内外向量。

skip-gram对应的答案:

Jskipgram(wordcmc+m)vk=mjm,j0F(wc+j,vc)vk

Jskipgram(wordcmc+m)uk=mjm,j0F(wc+j,vc)uk

其中wc+j为从中心数第j个词所对应的one-hot vector。

CBOW对应的答案:

JCBOW(wordcm+m)vk=F(wc,v^)vk=F(wc,v^)v^v^vk={F(wc,v^)v^0k{wcm,,wc1,wc+1,,wc+m}k{wcm,,wc1,wc+1,,wc+m}

JCBOW(wordcm+m)wk=F(wc,v^)wk

ps: 感觉这个答案好简单的样子,为什么要给8分呢?


(e)(f)(g)(h). 见代码,略


附一张训出来的图,也就是我跑完q3_run.py之后出现的图,reddit 上有人讨论怎么看这个图是否合理:

resulting word vector