# Maximum likelihood in multivariate Gaussian distribution (1)

来源:互联网 发布:google play推荐算法 编辑:程序博客网 时间:2024/06/03 03:54

The multivariate Gaussian distrubution is a well-known probability distribution, with a distribution function as P(x)=12πn2|Σ|12e(xμ)TΣ1(xμ)2. As we all know, if we assume the we have m sample data, the value of μ and Σ can be determined with the sample data x(1),x(2)x(m) by the following equations with maximum likelihood estimation:

μΣ=x(1)+x(2)++x(m)m=i=1m (x(i)μ)(x(i)μ)T

However, how does the formula be derived?

Prerequest

To understand the points I make in this article, you have to hold the basic ideas about calculus, linear algebra, and statistics. You should be familar with derivation, matrix and maximum-likelihood estimation.

Derivation

We start out to write out the log-likelihood function of μ and Σ. We denote the likelihood function as L(μ,Σ), and

L(μ,Σ)=i=1m12πn2|Σ|12e(x(i)μ)TΣ1(x(i)μ)2

So, the log-likelihood function l(μ,Σ) is

l(μ,Σ)=logi=1m12πn2|Σ|12e(x(i)μ)TΣ1(x(i)μ)2=i=1mlog12πn2|Σ|12e(x(i)μ)TΣ1(x(i)μ)2=i=1mlog12πn2|Σ|12(x(i)μ)TΣ1(x(i)μ)2=i=1m(n2log(2π)12log|Σ|(x(i)μ)TΣ1(x(i)μ)2)=mn2log(2π)m2log|Σ|i=1m(x(i)μ)TΣ1(x(i)μ)2

Then, the derivative of l with respect to μ is given by

lμ=(mi=1(x(i)μ)TΣ1(x(i)μ)2)μ

First, we expand the equation and get

lμ=12(mi=1((x(i))TΣ1μ+μTΣ1x(i)μTΣ1μ(x(i))TΣ1x(i)))μ

Then, we move the partial derivative into the summation

lμ=12i=1m((x(i))TΣ1μ+μTΣ1x(i)μTΣ1μ(x(i))TΣ1x(i))μ=12i=1m((x(i))TΣ1μ+μTΣ1x(i)μTΣ1μ)μ

Because the last term (x(i))TΣ1x(i) hase no relation with μ

Analysis

We analyse each terms in the formula. The first point I want to make is that (x(i))TΣ1μ=μTΣ1x(i). Why? Let’s consider a general form of this identity which I will prove in the next few paragraphs.

Theorem: Let a, b to be n-dimentional vectors and X is a n-by-n sysmetric matrix, then the following identity holds:

aTXb=bTXa

Proof: You can expand the two terms at both sides of this identity, and you will get

aTXbbTXa=i=1,j=1naiXijbj=i=1,j=1nbiXijaj=i=1,j=1naiXjibj

As a result, aTXb=bTXa if and only if Xij=Xji, i,j1,2n

We can interpret the result with inner product. Recall that the inner product of the vector a and b is defined as <a,b>=aTb. And it is obvious that bTa=<b,a>=<a,b>=aTb=ni=1aibi. In fact, we can also define the weighted inner product of a and b as <a,b>W=ni=1,j=1aiWijbj=aTWb, where the matrix W is a symetric matrix called weight matrix. The term ‘weighted inner product’ is used to emphasize that we are about to “insert” a “weight” into the product of all the pairs of every items in a and b, namely, ai and bj. The inner product of two vectors a and b is cummutative, and so as its weighted counterpart with respect to a perticular weight matrix. As a result, the identity in the previous theorem, aTXb=bTXa, is now almost obvious. It is just the cummutative rule.

We continue to analyse the lμ. Now, we can get

lμ=12i=1m(2μTΣ1x(i)μTΣ1μ)μ

Calculate lμ is not so easy, so we instead calculate lμj, which is the derivative of l with respect to the j’s item in μ.

lμj=12i=1m(2μTμjΣ1x(i)μTμjΣ1μμTΣ1μμj)

We denote the column vector with a 1 in its j’s item and 0 in its all other items as ej, it can be shown that μμj=ej because only the j’s item in μ has relation with μj. So we can get

lμj=12i=1m(2eTjΣ1x(i)eTjΣ1μμTΣ1ej)=12i=1m(2eTjΣ1x(i)2eTjΣ1μ)

Note that we make use of the theorem that aTXb=bTXa if X is symetric again.

eTjΣ1x(i)=eTj(Σ1x(i)) is just the j’s item of Σ1x(i) (which is simply a n-dimentional vector), so we denote it as (Σ1x(i))j (the same as the convention of some numerical package such as numpy). As a result, the equation will be converted to

lμj=12i=1m(2(Σ1x(i))j2(Σ1μ)j)=i=1m(Σ1x(i)Σ1μ)j=(i=1m(Σ1x(i)Σ1μ))j

That is, lμj is just the j’s item of (mi=1(Σ1x(i)Σ1μ)). As a result, lμ is just (mi=1(Σ1x(i)Σ1μ)).

Recall that we want to maximize l with respect to μ. So we should set lμ to 0. Thus, we will get

i=1m(Σ1x(i)Σ1μ)=0

The term Σ1 is obviously non-singular. So we can get

i=1m(x(i)μ)=0

and the expression for μ is thus just

μ=mi=1x(i)m

0 0
原创粉丝点击