The solution on the Elements of Statistical Learning ( Ex. 8)

来源:互联网 发布:淘宝女装网红店有哪些 编辑:程序博客网 时间:2024/06/05 03:46

Preface

If you find any errata or have a good idea, please contact me vi tongust@163.com.

Ex. 8.1

First of all, we should refer to the theory Kullback-Leibler Divergence. Here, I just give a brief derivation.
To proof KL divergence, we use the Jensen Inequality:

p(x)f(x)dxf{p(x)xdx}......(1)

The constrains which the formula must satisfy are: the function f(x) must be a convex function at convex set.
In this case, we construct a simple convex function f(x)=ln(x), and it has the following property:
p(x)ln(x)dxln{p(x)xdx}......(2)

Substitute x = q(x)p(x) into (2):
p(x)ln(q(x)p(x))dxln{p(x)q(x)p(x)dx}......(3)

p(x)ln(q(x)p(x))dxln{q(x)dx}......(4)

As we know q(x) is a distribution function, it is obvious that q(x)dx=1.
Therefore, we get the KL divergence:
DKL(p||q)=p(x)ln(q(x)p(x))dx0......(5)

Since we has shown that (8.61) is maximized as a function of r(y) when r(y)=q(y), R(θ,θ)=p(Z|θ)ln(Zm|Z,θ,θ) is a convex function and satisfies KL divergence. Hence R(θ,θ)R(θ,θ)0

Ex. 8.2

Off the topics

Since this exercise is based on one paper[1], I would say that our lovely authors of ESL just overestimated those poor readers like me to be excellent mathematicians.;)

Proof

I will use the notations in [1] instead of those in ESL which is a bit confusing to me. (Denote Zm as y)
We want to proof : For a fixed value θ, there is a unique distribution, Pθ, given by Pθ(y)=P(y|z,θ), which maximizes the log-likelihood (8.48).
From the hint, we use Lagrange Multiplier to rewrite our formula.

L(P(y),λ)=
y=1nP(yi)ln(Pθ(z,y))+yinP(yi)ln(P(yi))+λ(1y=1nP(yi))......(1)

To get the stationary points, we set the gradient of L(P(y),λ) W.R.T (with respect to) P(yi)(i=1,2,...,n) with zero :
dLdP(yi)=lnPθ(z,y)+1+lnP(yi)λ=0......(2)

To simplify:
λ=1ln(P(yi))P(z,y|θ))......(3)

P(yi)=exp(1λ)P(z,y|θ)......(4)

i=1,2,3,...,n

From (4), it follows that P(y) must be proportional to Pθ(z,y)=P(y,z|θ). Also we notice that yP(y)=1
Summing (4) W.R.T yi, we can see:
1=yP(y)=exp(1λ)yP(y,z|θ)......(5)

yP(y,z|θ)=P(z|θ)=1exp(1λ)......(6)

exp(1λ)=1P(z|θ)......(7)

Substitute (7) into (4):
P(yi)=P(z,y|θ)P(z|θ)=P(y|z,θ)......(8)

Ex. 8.3

Ex. 8.4

Ex. 8.5

Ex. 8.6

Ex. 8.7

Proof f(x) is non-decreasing under update (8.63)

From (8.62), we have

f(xs+1)g(xs+1,xs)g(xs,xs)=f(xs)......(1)

Proof EM algorithm (Sec. 8.5.2) is an example of an EM algorithms

This exercise need us to show following:

Q(θ,θ)+log(Pr(Z|θ))Q(θ,θ)log(θ,Z)......(2)

On one hand, from (8.46), we can denote that:
log(Pr(Z|θ))=Q(θ,θ)R(θ,θ)......(3)

Hence, the left hand side (l.h.s) of equation (2) can be simplified as:
Q(θ,θ)+Q(θ,θ)R(θ,θ)Q(θ,θ)=(θ,θ)R(θ,θ)......(4)

On the other hand, also from (8.46), the r.h.s of (2) can be written as:
log(θ,Z)=Q(θ,θ)R(θ,θ)......(5)

From Ex. 8.1, we see:
R(θ,θ)R(θ,θ).......(6)

R(θ,θ)R(θ,θ).......(7)

Q(θ,θ)R(θ,θ)Q(θ,θ)R(θ,θ).......(8)

Finally, we get (4) (5) to finish our demonstration.

Reference

[1] Neal, Radford M., and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Models. Springer Netherlands, 2000:355-368.

0 0
原创粉丝点击