Logistic回归与最小二乘概率分类算法简述与示例

来源：互联网发布：电脑音效软件编辑：程序博客网时间：2024/06/06 09:43

Logistic Regression & Least Square Probability Classification

1. Logistic Regression

Likelihood function, as interpreted by wikipedia:

https://en.wikipedia.org/wiki/Likelihood_function

plays one of the key roles in statistic inference, especially methods of estimating a parameter from a set of statistics. In this article, we’ll make full use of it.
Pattern recognition works on the way that learning the posterior probability p(y|x) of pattern x belonging to class y. In view of a pattern x, when the posterior probability of one of the class y achieves the maximum, we can take x for class y, i.e.

y^= arg max y = 1, \dots, c p (u | x)

The posterior probability can be seen as the credibility of model

x belonging to class

y.
In Logistic regression algorithm, we make use of linear logarithmic function to analyze the posterior probability:

q (y | x, θ) = exp ( \sum b j = 1 θ ( y ) j ϕ j ( x ) ) \sum c y ' = 1 exp ( \sum b j = 1 θ ( y ' ) j ϕ j ( x ) )

Note that the denominator is a kind of regularization term. Then the Logistic regression is defined by the following optimal problem:

max θ \sum i = 1 m log q (y i | x i, θ)

We can solve it by gradient descent method:

Initialize θ.
Pick up a training sample (xi,yi) randomly.
Update θ=(θ(1)T,…,θ(c)T)T along the direction of gradient ascent: $θ (y) \leftarrow θ (y) + ϵ \nabla y J i (θ), y = 1, \dots, c$ where $\nabla y J i (θ) = - exp ( θ ( y ) T ϕ ( x i ) ) ϕ ( x i ) \sum c y ' = 1 exp ( θ ( y ' ) T ϕ ( x i ) ) + {ϕ (x i) 0 (y = y i) (y \neq y i)$
Go back to step 2,3 until we get a θ of suitable precision.

Take the Gaussian Kernal Model as an example:

q (y | x, θ) \propto exp ⎛ ⎝ \sum j = 1 n θ j K (x, x j) ⎞ ⎠

Aren’t you familiar with Gaussian Kernal Model? Refer to this article:

http://blog.csdn.net/philthinker/article/details/65628280

Here are the corresponding MATLAB codes:

n=90; c=3; y=ones(n/c,1)*(1:c); y=y(:);x=randn(n/c,c)+repmat(linspace(-3,3,c),n/c,1);x=x(:);hh=2*1^2; t0=randn(n,c);for o=1:n*1000    i=ceil(rand*n); yi=y(i); ki=exp(-(x-x(i)).^2/hh);    ci=exp(ki'*t0); t=t0-0.1*(ki*ci)/(1+sum(ci));    t(:,yi)=t(:,yi)+0.1*ki;    if norm(t-t0)<0.000001        break;    end    t0=t;endN=100; X=linspace(-5,5,N)';K=exp(-(repmat(X.^2,1,n)+repmat(x.^2',N,1)-2*X*x')/hh);figure(1); clf; hold on; axis([-5,5,-0.3,1.8]);C=exp(K*t); C=C./repmat(sum(C,2),1,c);plot(X,C(:,1),'b-');plot(X,C(:,2),'r--');plot(X,C(:,3),'g:');plot(x(y==1),-0.1*ones(n/c,1),'bo');plot(x(y==2),-0.2*ones(n/c,1),'rx');plot(x(y==3),-0.1*ones(n/c,1),'gv');legend('q(y=1|x)','q(y=2|x)','q(y=3|x)');

这里写图片描述

2. Least Square Probability Classification

In LS probability classifiers, linear parameterized model is used to express the posterior probability:

q (y | x, θ (y)) = \sum j = 1 b θ (y) j ϕ j (x) = θ (y) T ϕ (x), y = 1, \dots, c

These models depends on the parameters

θ(y)=（θ(y)1,…,θ(y)b）T correlated to each classes

y that is diverse from the one used by Logistic classifiers. Learning those models means to minimize the following quadratic error:

J y (θ (y)) = = 1 2 \int (q (y | x, θ (y)) - p (y | x)) 2 p (x) d x 1 2 \int q (y | x, θ (y)) 2 p (x) d x - \int q (y | x, θ (y)) p (y | x) p (x) d x + 1 2 \int p (y | x) 2 p (x) d x

where

p(x) represents the probability density of training set

{xi}ni=1.
By the Bayesian formula,

p (y | x) p (x) = p (x, y) = p (x | y) p (y)

Hence

Jy can be reformulated as

J y (θ (y)) = 1 2 \int q (y | x, θ (y)) 2 p (x) d x - \int q (y | x, θ (y)) p (x | y) p (y) d x + 1 2 \int p (y | x) 2 p (x) d x

Note that the first term and second term in the equation above stand for the mathematical expectation of

p(x) and

p(x|y) respectively, which are often impossible to calculate directly. The last term is independent of

θ and thus can be omitted.
Due to the fact that

p(x|y) is the probability density of sample

x belonging to class

y, we are able to estimate term 1 and 2 by the following averages:

1 n \sum i = 1 n q (y | x i, θ (y)) 2, 1 n y \sum i : y i = y q (y | x i, θ (y)) p (y)

Next, we introduce the regularization term to get the following calculation rule:

J^y (θ (y)) = 1 2 n \sum i = 1 n q (y | x i, θ (y)) 2 - 1 n y \sum i : y i = y q (y | x i, θ (y)) + λ 2 n ∥ θ (y) ∥ 2

Let

π(y)=(π(y)1,…,π(y)n)T and

π(y)i={1(yi=y)0(yi≠y), then

J^y (θ (y)) = 1 2 n θ (y) T Φ T Φ θ (y) - 1 n θ (y) T Φ T π (y) + λ 2 n ∥ θ (y) ∥ 2

.
Therefore, it is evident that the problem above can be formulated as a convex optimization problem, and we can get the analytic solution by setting the twice order derivative to zero:

θ^(y) = (Φ T Φ + λ I) - 1 Φ T π (y)

.
In order not to get a negative estimation of the posterior probability, we need to add a constrain on the negative outcome:

p^(y | x) = max ( 0 , θ ^ ( y ) T ϕ ( x ) ) \sum c y ' = 1 max ( 0 , θ ^ ( y ' ) T ϕ ( x ) )

We also take Gaussian Kernal Models for example:

n=90; c=3; y=ones(n/c,1)*(1:c); y=y(:);x=randn(n/c,c)+repmat(linspace(-3,3,c),n/c,1);x=x(:);hh=2*1^2; x2=x.^2; l=0.1; N=100; X=linspace(-5,5,N)';k=exp(-(repmat(x2,1,n)+repmat(x2',n,1)-2*x*(x'))/hh);K=exp(-(repmat(X.^2,1,n)+repmat(x2',N,1)-2*X*(x'))/hh);for yy=1:c    yk=(y==yy); ky=k(:,yk);    ty=(ky'*ky +l*eye(sum(yk)))\(ky'*yk);    Kt(:,yy)=max(0,K(:,yk)*ty);endph=Kt./repmat(sum(Kt,2),1,c);figure(1); clf; hold on; axis([-5,5,-0.3,1.8]);C=exp(K*t); C=C./repmat(sum(C,2),1,c);plot(X,C(:,1),'b-');plot(X,C(:,2),'r--');plot(X,C(:,3),'g:');plot(x(y==1),-0.1*ones(n/c,1),'bo');plot(x(y==2),-0.2*ones(n/c,1),'rx');plot(x(y==3),-0.1*ones(n/c,1),'gv');legend('q(y=1|x)','q(y=2|x)','q(y=3|x)');

这里写图片描述

3. Summary

Logistic regression is good at dealing with sample set with small size since it works in a simple way. However, when the number of samples is large to some degree, it is better to turn to the least square probability classifiers.

0 0