机器学习笔记1-Supervised learning

来源：互联网发布：独立域名编辑：程序博客网时间：2024/06/06 02:22

1.1 Classification and logistic regression
classification problem is just like the regression problem, except that the values y we now want to predict take on only a small number of discrete values. For now, we will focus on the binary classiﬁcation problem in which y can take on only two values, 0 and 1. For instance, if we are trying to build a spam classiﬁer for email, then x(i) may be some features of a piece of email, and y may be 1 if it is a piece of spam mail, and 0 otherwise. 0 is also called the negative class, and 1 the positive class, and they are sometimes also denoted by the symbols “-” and “+.” Given x(i), the corresponding y(i) is also called the label for the training example.
1.2 Logistic regression
logistic function

h θ (x) = g (θ T x) = 1 1 + e θ T x

where

g (z) = 1 1 + e - z

useful property

g' (z) = d d z 1 1 + e - z

= 1 ( 1 + e - z ) 2 (e - z)

= 1 ( 1 + e - z ) (1 - 1 ( 1 + e - z ))

= g (z) (1 - g (z))

given the logistic regression model, how do we ﬁt θ for it?
Let us assume that:

P (y = 1 | x; θ) = h θ (x)

P (y = 0 | x; θ) = 1 - h θ (x)

written more compactly as:

p (y | x; θ) = (h θ (x)) y (1 - h θ (x)) (1 - y)

L (θ) = p (Y | X; θ)

= \prod i = 1 m p (y (i) | x (i); θ)

= \prod i = 1 m (h θ (x (i))) y (1 - h θ (x (i))) (1 - y (i)))

maximize the log likelihood

l (θ) = l o g L (θ)

= \sum i = 1 m y (i) l o g h θ (x (i)) + (1 - y (i)) l o g (1 - h θ (x (i)))

maximize the likelihood

θ : = θ + α \nabla θ l (θ)

\partial l ( θ ) \partial θ j = (y 1 g ( θ T x ) - (1 - y) 1 1 - g ( θ T x )) \partial \partial θ j g (θ T x)

= (y 1 g ( θ T x ) - (1 - y) 1 1 - g ( θ T x )) g (θ T x) (1 - g (θ T x)) \partial \partial θ j θ T x

= (y (1 - g (θ T x)) - (1 - y) g (θ T x)) θ j

= (y - h θ (x)) x j

This therefore gives us the stochastic gradient ascent rule:

θ j : = θ j + α (y (i) - h θ (x (i))) x (i) j

1.3 Digression: The perceptron learning algorithm
Consider modifying the logistic regression method to “force” it to output values that are either 0 or 1 or exactly. To do so, it seems natural to change the deﬁnition of g to be the threshold function:

g (z) = {+ 1, - 1, i f z \geq 0 z < 0

If we then let

hθ(x)=g(θTx) as before but using this modiﬁed deﬁnition of g, and if we use the update rule :

θ j : = θ j + α (y (i) - h θ (x (i))) x (i) j

then we have the perceptron learning algorithm.

0 0