# Maximum likelihood in multivariate Gaussian distribution (1)

来源：互联网发布：google play推荐算法编辑：程序博客网时间：2024/06/03 03:54

The multivariate Gaussian distrubution is a well-known probability distribution, with a distribution function as P(x)=12πn2|Σ|12e−(x−μ)TΣ−1(x−μ)2. As we all know, if we assume the we have m sample data, the value of μ and Σ can be determined with the sample data x(1),x(2)⋯x(m) by the following equations with maximum likelihood estimation:

μ Σ = x ( 1 ) + x ( 2 ) + \dots + x ( m ) m = \sum i = 1 m (x (i) - μ) (x (i) - μ) T

However, how does the formula be derived?

Prerequest

To understand the points I make in this article, you have to hold the basic ideas about calculus, linear algebra, and statistics. You should be familar with derivation, matrix and maximum-likelihood estimation.

Derivation

We start out to write out the log-likelihood function of μ and Σ. We denote the likelihood function as L(μ,Σ), and

L (μ, Σ) = \prod i = 1 m 1 2 π n 2 | Σ | 1 2 e - ( x ( i ) - μ ) T Σ - 1 ( x ( i ) - μ ) 2

So, the log-likelihood function l(μ,Σ) is

l (μ, Σ) = l o g \prod i = 1 m 1 2 π n 2 | Σ | 1 2 e - ( x ( i ) - μ ) T Σ - 1 ( x ( i ) - μ ) 2 = \sum i = 1 m l o g ⎛ ⎝ 1 2 π n 2 | Σ | 1 2 e - ( x ( i ) - μ ) T Σ - 1 ( x ( i ) - μ ) 2 ⎞ ⎠ = \sum i = 1 m ⎛ ⎝ l o g ⎛ ⎝ 1 2 π n 2 | Σ | 1 2 ⎞ ⎠ - ( x ( i ) - μ ) T Σ - 1 ( x ( i ) - μ ) 2 ⎞ ⎠ = \sum i = 1 m (- n 2 l o g (2 π) - 1 2 l o g | Σ | - ( x ( i ) - μ ) T Σ - 1 ( x ( i ) - μ ) 2) = - m n 2 l o g (2 π) - m 2 l o g | Σ | - \sum i = 1 m ( x ( i ) - μ ) T Σ - 1 ( x ( i ) - μ ) 2

Then, the derivative of l with respect to μ is given by

\partial l \partial μ = \partial ( - \sum m i = 1 ( x ( i ) - μ ) T Σ - 1 ( x ( i ) - μ ) 2 ) \partial μ

First, we expand the equation and get

\partial l \partial μ = 1 2 \partial ( \sum m i = 1 ( ( x ( i ) ) T Σ - 1 μ + μ T Σ - 1 x ( i ) - μ T Σ - 1 μ - ( x ( i ) ) T Σ - 1 x ( i ) ) ) \partial μ

Then, we move the partial derivative into the summation

\partial l \partial μ = 1 2 \sum i = 1 m \partial ( ( x ( i ) ) T Σ - 1 μ + μ T Σ - 1 x ( i ) - μ T Σ - 1 μ - ( x ( i ) ) T Σ - 1 x ( i ) ) \partial μ = 1 2 \sum i = 1 m \partial ( ( x ( i ) ) T Σ - 1 μ + μ T Σ - 1 x ( i ) - μ T Σ - 1 μ ) \partial μ

Because the last term

(x(i))TΣ−1x(i) hase no relation with

Analysis

We analyse each terms in the formula. The first point I want to make is that (x(i))TΣ−1μ=μTΣ−1x(i). Why? Let’s consider a general form of this identity which I will prove in the next few paragraphs.

Theorem: Let a, b to be n-dimentional vectors and X is a n-by-n sysmetric matrix, then the following identity holds:

a T X b = b T X a

Proof: You can expand the two terms at both sides of this identity, and you will get

a T X b b T X a = \sum i = 1, j = 1 n a i X i j b j = \sum i = 1, j = 1 n b i X i j a j = \sum i = 1, j = 1 n a i X j i b j

As a result, aTXb=bTXa if and only if Xij=Xji, ∀i,j∈1,2⋯n

We can interpret the result with inner product. Recall that the inner product of the vector a and b is defined as <a,b>=aTb. And it is obvious that bTa=<b,a>=<a,b>=aTb=∑ni=1aibi. In fact, we can also define the weighted inner product of a and b as <a,b>W=∑ni=1,j=1aiWijbj=aTWb, where the matrix W is a symetric matrix called weight matrix. The term ‘weighted inner product’ is used to emphasize that we are about to “insert” a “weight” into the product of all the pairs of every items in a and b, namely, ai and bj. The inner product of two vectors a and b is cummutative, and so as its weighted counterpart with respect to a perticular weight matrix. As a result, the identity in the previous theorem, aTXb=bTXa, is now almost obvious. It is just the cummutative rule.

We continue to analyse the ∂l∂μ. Now, we can get

\partial l \partial μ = 1 2 \sum i = 1 m \partial ( 2 μ T Σ - 1 x ( i ) - μ T Σ - 1 μ ) \partial μ

Calculate ∂l∂μ is not so easy, so we instead calculate ∂l∂μj, which is the derivative of l with respect to the j’s item in μ.

\partial l \partial μ j = 1 2 \sum i = 1 m (2 \partial μ T \partial μ j Σ - 1 x (i) - \partial μ T \partial μ j Σ - 1 μ - μ T Σ - 1 \partial μ \partial μ j)

We denote the column vector with a 1 in its j’s item and 0 in its all other items as ej, it can be shown that ∂μ∂μj=ej because only the j’s item in μ has relation with μj. So we can get

\partial l \partial μ j = 1 2 \sum i = 1 m (2 e T j Σ - 1 x (i) - e T j Σ - 1 μ - μ T Σ - 1 e j) = 1 2 \sum i = 1 m (2 e T j Σ - 1 x (i) - 2 e T j Σ - 1 μ)

Note that we make use of the theorem that aTXb=bTXa if X is symetric again.

eTjΣ−1x(i)=eTj(Σ−1x(i)) is just the j’s item of Σ−1x(i) (which is simply a n-dimentional vector), so we denote it as (Σ−1x(i))j (the same as the convention of some numerical package such as numpy). As a result, the equation will be converted to

\partial l \partial μ j = 1 2 \sum i = 1 m (2 (Σ - 1 x (i)) j - 2 (Σ - 1 μ) j) = \sum i = 1 m (Σ - 1 x (i) - Σ - 1 μ) j = (\sum i = 1 m (Σ - 1 x (i) - Σ - 1 μ)) j

That is, ∂l∂μj is just the j’s item of (∑mi=1(Σ−1x(i)−Σ−1μ)). As a result, ∂l∂μ is just (∑mi=1(Σ−1x(i)−Σ−1μ)).

Recall that we want to maximize l with respect to μ. So we should set ∂l∂μ to 0. Thus, we will get

\sum i = 1 m (Σ - 1 x (i) - Σ - 1 μ) = 0

The term Σ−1 is obviously non-singular. So we can get

\sum i = 1 m (x (i) - μ) = 0

and the expression for μ is thus just

μ = \sum m i = 1 x ( i ) m

0 0