CS229 Lecture Notes(2): Logistic Regression

来源：互联网发布：舆情监控软件多少钱编辑：程序博客网时间：2024/05/29 17:28

Logistic Regression

Binary classification problem
Failure of OLS regression in binary classification problem:
- hard to define the threshold
- no sense if y>1 or y<0
Hypothesis:
$h θ (x) = g (θ T x) = 1 1 + e - θ T x$
where $g (z) = 1 1 + e - z$ is called the logistic function or the sigmoid function.
A useful property of sigmoid function: $g' (z) = g (z) (1 - g (z))$
理论上，似乎任何一个值域在[0,1]区间上的平滑单增函数都可以做为hypothesis中的g(z)。然而，在学习了GLM和generative learning algorithms后，我们会看到这里选择sigmoid function的原因。

Probabilistic assumption: Bernoulli distribution
$p (y | x; θ) = (h θ (x)) y (1 - h θ (x)) 1 - y$
Likelihood function:
$L (θ) = \prod i = 1 m p (y (i) | x (i); θ) = \prod i = 1 m (h θ (x (i))) y (i) (1 - h θ (x (i))) 1 - y (i)$
log likelihood: $l (θ) = log L (θ) = \sum i = 1 m y (i) log h θ (x (i)) + (1 - y (i)) log (1 - h θ (x (i)))$
Gradient ascent (since we’re maximizing rather than minimizing a function now):
$θ : = θ + α \nabla θ l (θ)$
where $\partial \partial θ j l (θ) = (y - h θ (x)) x j$
在logistic regression中，我们得到一个与linear regression类似的更新法则：除了这里的hθ(x)是θTx的一个非线性函数。这只是一个巧合，还是有什么更深层次的原因呢？我们会在学习GLM模型时给出解答。

Hypothesis:
$h θ (x) = g (θ T x)$
where $g (z) = {10 z \geq 0 z < 0$
注意这里的g(z)在z=0处不可微，所以很难给予perceptron一个概率性的解释，并用最大似然法去求解。
Perceptron learning algorithm:
$θ j : = θ j + α (y (i) - h θ (x (i))) x (i) j$

Newton’s method: to find a value of θ so that f(θ)=0, we perform the following update:
$θ : = θ - f ( θ ) f ' ( θ )$
Using Newton’s method to maximize l(θ) by letting f(θ)=l′(θ)=0:
$θ : = θ - l ' ( θ ) l ″ ( θ )$
Newton-Raphson method (also called Fisher scoring when applied to logistic regression problem): a vectorized generalization of Newton’s method:
$θ : = θ - H - 1 \nabla θ l (θ)$
where $H i j = \partial 2 l ( θ ) \partial θ i \partial θ j$ is called Hessian Matrix.

虽然计算Hessian矩阵比较耗时，但由于引入了二阶偏导信息，Newton迭代法在求解最大似然函数时往往要比Gradient Descent更快地收敛。

0 0