The solution on the Elements of Statistical Learning ( Ex. 8)

来源：互联网发布：淘宝女装网红店有哪些编辑：程序博客网时间：2024/06/05 03:46

Preface

If you find any errata or have a good idea, please contact me vi tongust@163.com.

Ex. 8.1

First of all, we should refer to the theory Kullback-Leibler Divergence. Here, I just give a brief derivation.
To proof KL divergence, we use the Jensen Inequality:

\int p (x) f (x) d x \geq f {\int p (x) x d x} . . . . . . (1)

The constrains which the formula must satisfy are: the function

f(x) must be a convex function at convex set.
In this case, we construct a simple convex function

f(x)=−ln(x), and it has the following property:

- \int p (x) l n (x) d x \geq - l n {\int p (x) x d x} . . . . . . (2)

Substitute

x =

q(x)p(x) into (2):

- \int p (x) l n (q ( x ) p ( x )) d x \geq - l n {\int p (x) q ( x ) p ( x ) d x} . . . . . . (3)

\int p (x) l n (q ( x ) p ( x )) d x \leq l n {\int q (x) d x} . . . . . . (4)

As we know

q(x) is a distribution function, it is obvious that

∫q(x)dx=1.
Therefore, we get the KL divergence:

D K L (p | | q) = \int p (x) l n (q ( x ) p ( x )) d x \leq 0...... (5)

Since we has shown that (8.61) is maximized as a function of

r(y) when

r(y)=q(y),

−R(θ′,θ)=−∫p(Z|θ)ln(Zm|Z,θ′,θ) is a convex function and satisfies KL divergence. Hence

R(θ′,θ)−R(θ,θ)≤0

□

Ex. 8.2

Off the topics

Since this exercise is based on one paper[1], I would say that our lovely authors of ESL just overestimated those poor readers like me to be excellent mathematicians.;)

Proof

I will use the notations in [1] instead of those in ESL which is a bit confusing to me. (Denote Zm as y)
We want to proof : For a fixed value θ, there is a unique distribution, Pθ, given by Pθ(y)=P(y|z,θ), which maximizes the log-likelihood (8.48).
From the hint, we use Lagrange Multiplier to rewrite our formula.

L (P (y), λ) =

- \sum y = 1 n P (y i) l n (P θ (z, y)) + \sum y i n P (y i) l n (P (y i)) + λ (1 - \sum y = 1 n P (y i)) . . . . . . (1)

To get the stationary points, we set the gradient of

L(P(y),λ) W.R.T (with respect to)

P(yi)(i=1,2,...,n) with zero :

d L d P ( y i ) = - l n P θ (z, y) + 1 + l n P (y i) - λ = 0...... (2)

To simplify:

λ = 1 - l n (P ( y i ) ) P ( z , y | θ )) . . . . . . (3)

P (y i) = e x p (1 - λ) P (z, y | θ) . . . . . . (4)

i = 1, 2, 3, . . ., n

From (4), it follows that

P(y) must be proportional to

Pθ(z,y)=P(y,z|θ). Also we notice that

∑yP(y)=1
Summing (4) W.R.T

yi, we can see:

1 = \sum y P (y) = e x p (1 - λ) \sum y P (y, z | θ) . . . . . . (5)

\sum y P (y, z | θ) = P (z | θ) = 1 e x p ( 1 - λ ) . . . . . . (6)

e x p (1 - λ) = 1 P ( z | θ ) . . . . . . (7)

Substitute (7) into (4):

P (y i) = P ( z , y | θ ) P ( z | θ ) = P (y | z, θ) . . . . . . (8)

□

Ex. 8.3

Ex. 8.4

Ex. 8.5

Ex. 8.6

Ex. 8.7

Proof f(x) is non-decreasing under update (8.63)

From (8.62), we have

f (x s + 1) \geq g (x s + 1, x s) \geq g (x s, x s) = f (x s) . . . . . . (1)

□

Proof EM algorithm (Sec. 8.5.2) is an example of an EM algorithms

This exercise need us to show following:

Q (θ', θ) + l o g (P r (Z | θ)) - Q (θ, θ) \leq l o g (θ', Z) . . . . . . (2)

On one hand, from (8.46), we can denote that:

l o g (P r (Z | θ)) = Q (θ, θ) - R (θ, θ) . . . . . . (3)

Hence, the left hand side (l.h.s) of equation (2) can be simplified as:

Q (θ', θ) + Q (θ, θ) - R (θ, θ) - Q (θ, θ) = (θ', θ) - R (θ', θ) . . . . . . (4)

On the other hand, also from (8.46), the r.h.s of (2) can be written as:

l o g (θ', Z) = Q (θ', θ) - R (θ', θ) . . . . . . (5)

From Ex. 8.1, we see:

R (θ, θ) \geq R (θ', θ) . . . . . . . (6)

- R (θ, θ) \leq - R (θ', θ) . . . . . . . (7)

Q (θ', θ) - R (θ, θ) \leq Q (θ', θ) - R (θ', θ) . . . . . . . (8)

Finally, we get (4)

≤ (5) to finish our demonstration.

□

Reference

[1] Neal, Radford M., and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Models. Springer Netherlands, 2000:355-368.

0 0