CS224d Assignment1 答案, Part(2/4)

来源：互联网发布：英格拉姆弹跳数据编辑：程序博客网时间：2024/06/08 15:04

Assignment1的答案一共被我分成了4部分，分别包含第1，2，3，4题。这部分包含第2题的答案。

2. Neural Network Basics (30 points)

(a). (3 points) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e. in some expression where only σ(x), but not x, is present). Assume that the input x is a scalar for this question. Recall, the sigmoid function is

σ (x) = 1 1 + e - x (2)

解：

σ' (x) = - (1 + e - x) - 2 (- e - x) = e - x ( 1 + e - x ) 2 = 1 1 + e - x (1 - 1 1 + e - x) = σ (x) [1 - σ (x)]

(b). (3 points) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation, i.e. find the gradients with respect to the softmax input vector θ, when the prediction is made by y^=softmax(θ). Remember the cross entropy function is

C E (y, y^) = - \sum i y i log (y^i) (3)

where

y is the one-hot label vector, and

y^ is the predicted probability vector for all classes. (Hint: you might want to consider the fact many elements of

y are zeros, and assume that only the k-th dimension of

y is one.)

解：根据提示，假设y的第k个值为1，其余值都为0，即yk=1那么有：

C E (y, y^) = - y k log (y^k) = - log (y^k)

对于

θ中的第

i个元素

θi，有：

∂CE(y,y^)∂θi=−∂logeθk∑jeθj∂θi=−∂(θk−log∑jeθj)∂θi=⎛⎝∂log∑jeθj∂θi−∂θk∂θi⎞⎠={y^i(y^i−1)i≠k,i=k

所以

\partial C E ( y , y ^ ) \partial θ = y^- y

(c). (6 points) Derive the gradients with respect to the inputs x to an one-hidden-layer neural network (that is, find ∂J∂x where J is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is y, and cross entropy cost is used. (feel free to use σ′(x) as the shorthand for sigmoid gradient, and feel free to define any variables whenever you see fit)
One layer perceptron
Recall that the forward propagation is as follows

h = s i g m o i d (x W 1 + b 1) y^= s o f t m a x (h W 2 + b 2)

Note that here we’re assuming that the input vector (thus the hidden variables and output probabilities) is a row vector to be consistent with the programming assignment. When we apply the sigmoid function to a vector, we are applying it to each of the elements of that vector.

Wi and

(i=1,2) are the weights and biases, respectively, of the two layers.

解：设y的第k个值为1，其余值都为0，即yk=1那么有：

J = - y k log (y^k) = - log (y^k)

记

hW2+b2=θ2，即

y^=softmax(θ2)，且记

θ2的第

i个元素为

θ(2)i，

W2的第

i行第

j列个元素为

W(2)ij那么有：

\partial J \partial h i = \sum j \partial J \partial θ ( 2 ) j \partial θ ( 2 ) j \partial h i = \sum j (y^j - y j) W (2) i j = (y^- y) W T 2 | i

其中

∂θ(2)j∂hi=W(2)ij，事实上，如果使用爱因斯坦求和约定，那么有

θ(2)j=hiW(2)ij+b(2)j，则可得

∂θ(2)j∂hi=W(2)ij。且

∂J∂θ2=(y^−y)可由上一问得到的。

记xW1+b1=θ1，即h=σ(θ1)，记θ1的第i个元素为 θ(1)i， W1的第i行第j列个元素为W(1)ij那么有：

\partial J \partial θ ( 1 ) i = \sum j \partial J \partial h j \partial h j \partial θ ( 1 ) i = \partial J \partial h i \partial h i \partial θ ( 1 ) i = (y^- y) W T 2 | i \cdot σ' (θ 1) | i

同时有：

\partial J \partial x i = \sum j \partial J \partial θ ( 1 ) j \partial θ ( 1 ) j \partial x i = \sum j \partial J \partial θ ( 1 ) j W (1) i j = ((y^- y) W T 2 \otimes σ' (θ 1)) W T 1 | i

其中

⊗表示按元素的积（elementwise product） (小吐槽一下，这个推导这么麻烦才给6分，太抠了)

(d). (2 points) How many parameters are there in this neural network, assuming the input is Dx-dimensional, the output is Dy-dimensional, and there are H hidden units?
解：W1的维度是Dx×H，b1的维度是1×H， W2的维度是H×Dy，b2的维度是1×Dy。所以一共有DxH+H+DyH+Dy个参数。

(e)(f)(g). 见代码，略

阅读全文

0 0