Machine Learning - Logistic Regression - Two-class Classification
来源:互联网 发布:启动顺序 windows boot 编辑:程序博客网 时间:2024/05/16 05:20
This series of articles are the study notes of " Machine Learning ", by Prof. Andrew Ng., Stanford University. This article is the notes of week 3, Logistic Regression part I. This article contains some topic about Logistic Regression, including classification, Decision boundary and Cost function.
Logistic Regression
1. Classification
In this and the next few sections, we'll start to talk about classification problems, where the variable y that you want to predict is valued. We'll develop an algorithm called logistic regression, which is one of the most popular and most widely used learning algorithms today. Here are some examples of classification problems.
Examples of Classification
Two-class (or binary) classification:
- Email: Spam / Not Spam?
- Online Transactions: Fraudulent (Yes / No)?
- Tumor: Malignant / Benign ?
Multi-class classification:
How to develop a classification algorithm?
Threshold classifier output hθ(x) at 0.5:
- If hθ(x) ≥0.5, predict “y = 1”
- If hθ(x) <0.5, predict “y = 0”
Add extra training example (on the right side)
Let me extend out the horizontal access a little bit and let's say we got one more training example way out there on the right. Notice that that additional training example, the one on the right, it doesn't actually change anything, right. Looking at the training set it's pretty clear what a good hypothesis is. Is that well everything to the right of somewhere around here, to the right of this we should predict this positive. Everything to the left we should probably predict as negative because from this training set, it looks like all the tumors larger than a certain value around here are malignant, and all the tumors smaller than that are not malignant, at least for this training set. But once we've added that extra example over here, if you now run linear regression, you instead get a straight line fit to the data. That might maybe look like the blue line. And if you know threshold hypothesis at 0.5, you end up with a threshold that's around here, so that everything to the right of this point you predict as positive and everything to the left of that point you predict as negative. And this seems a pretty bad thing for linear regression to have done.
linear regression to a classification problem often isn't a great idea
Classification: y=1 or 0
But for linear regression: hθ(x) can be >1 or <0
Logistic Regression is a classification algorithm
Logistic Regression: 0 ≤ hθ(x) ≤ 1
2. Hypothesis representation
Let's start talking about logistic regression. In this section, I'd like to show you the hypothesis representation.
Logistic Regression Model
That is, what is the function we're going to use to represent our hypothesis when we have a classification problem.We want 0 ≤hθ(x) ≤ 1
The sigmoid function g(z)
Sigmoid function also called Logistic function
Interpretation of Hypothesis Output
hθ(x) = estimated probability that y = 1 on input x
Example:
Tell patient that 70% chance of tumor being malignant.
3. Decision boundary
Logistic Regressions
Suppose predict “ y =1“ if hθ(x) ≥ 0.5
predict “ y =0“ if hθ(x)< 0.5
Decision boundary
example 1
Predict "y=1", if -3+x1+x2≥0, that is, if x1+x2≥3, then "y=1"
Predict "y=0", if -3+x1+x2<0, that is, if x1+x2<3, then "y=0"
example 2
Predict "y=1", if -1+x12+x22≥0, that is, if x12+x22≥1, then "y=1"
Predict "y=0", if -1+x12+x22<0, that is, if x12+x22<1, then "y=0"
If we plot the curve for x12+x22=1, you will recognize that, that is the equation for circle of radius one, centered around the origin. So that is my decision boundary. And everything outside the circle, I'm going to predict as y=1. So out here is my y=1 region, we'll predict y=1 out here and inside the circle is where I'll predict y=0.
So by adding these more complex, or these polynomial terms to my features as well, I can get more complex decision boundaries that don't just try to separate the positive and negative examples in a straight line that I can get in this example, a decision boundary that's a circle.
The decision boundary is a property of the hypothesis under the parameters
The decision boundary is a property, not of the trading set, but of the hypothesis under the parameters. So, so long as we're given my parameter vector theta, that defines the decision boundary, which is the circle. But the training set is not what we use to define the decision boundary. The training set may be used to fit the parameters theta. We'll talk about how to do that later. But, once you have the parameters theta, that is what defines the decisions boundary.
4. Cost function
Logistic Regression Model
Training set: m examples
Hypothesis
How to choose parameters θ?
Cost Function
Linear Regression
So, the cost function of a single example x can be written as
This cost function worked fine for linear regression. But it turns out that if we use this particular cost function in logistic regression, this would be a non-convex function of the parameter's data. Here's what I mean by non-convex. Have some cross function J(θ) and for logistic regression, this function h here has a nonlinearity that is one over one plus e to the negative theta transpose. So this is a pretty complicated nonlinear function.If you want to make predictions one thing you could try doing is then threshold the classifier outputs at 0.5 that is at a vertical axis value 0.5 and if the hypothesis outputs a value that is greater than equal to 0.5 you can take y = 1. If it's less than 0.5 you can take y=0.
non-convex function
convex function
Whereas in contrast what we would like is to have a cost function J(θ) that is convex, that is a single bow-shaped function that looks like the right side curve so that if you run gradient descent on this sort of function, we would be guaranteed that would converge to the global minimum.
Logistic Regression Cost Function of Single Example
Here is the plot of y= log(z).
So, we can draw the cost function below when y=1.
First, if y=1 and hθ(x)=1,in other words, if the hypothesis exactly predicts hθ(x)=1 and y is exactly equal to what it predicted, then the cost = 0. And that's where we'd like it to be because if we correctly predict the output y,then the cost is 0.
Cost = 0, if y =1 ,hθ(x) = 1
But as hθ(x) → 0, Cost → ∞
Captures intuition that if hθ(x)=0,(predict P(y=1|x;θ)=0), but y=1, we'll penalize learning algorithm by a very large cost.
Cost = 0, if y =0 ,hθ(x) = 0
But as hθ(x) → 1, Cost → ∞
So if y=0, that's going to be our cost function, if you look at this expression and you plot -log(1-z), if you figure out what that looks like, you get a figure that looks like this which goes from 0 to a with the z axis on the horizontal axis. So if you take this cost function and plot it for the case of y=0, what you get is that the cost function. Looks like the curve above. And, what this cost function does is that it goes up or it goes to a positive infinity as h of x goes to 1, and this catches the intuition that if a hypothesis predicted that your h of x is equal to 1 with certainty, with probably ones, absolutely going to bey=1.But if y turns out to be equal to 0, then it makes sense to make the hypothesis. So the make the learning algorithm up here a very large cost. And conversely, if hθ(x) =0 and y=0, then the hypothesis melted. The protected y of z is equal to 0, and it turns out y is equal to 0, so at this point, the cost function is going to be 0.
In this section, we define the cost function for a single train example. The topic of convexity analysis is now beyond the scope of this course, but it is possible to show that with a particular choice of cost function, this will give a convex optimization problem. Overall cost function J(θ) will be convex and local optima free.
5. Simplified cost function and gradient descent
Logistic Regression Cost Function
Note, y=1 or y=0 always, so
The cost function can be written as,
To fit parameters θ
Given this cost function J(θ), in order to fit the parameters, what we're going to do then is try to find the parametersθ that minimize J(θ). So if we try to minimize this, this would give us some set of parametersθ.
Finally, if we're given a new example with some set of features x, we can then take the thetas that we fit to our training set and output our prediction as this.
To make a prediction given new x, output
And just to remind you, the output of my hypothesis I'm going to interpret as the probability that y is equal to one. And given the input x and parameterized by theta: P(y=1|x; θ). But just, you can think of this as just my hypothesis as estimating the probability that y is equal to one.
Gradient Descent
Want minθ J(θ):
If we compute the partial derivative term in the update equation above,
So if you take this partial derivative term and plug it back in here, we can then write out our gradient descent algorithm as follows.
In Linear Regression:
In Logistic Regression:
Make sure your gradient descent work correctly
In an earlier video, when we were talking about gradient descent for linear regression, we had talked about how to monitor a gradient descent to make sure that it is converging. I usually apply that same method to logistic regression, too to monitor a gradient descent, to make sure it's converging correctly. And hopefully, you can figure out how to apply that technique to logistic regression yourself.
Plot
as a function of the number of iterations and make sure J(θ) is decreasing on every iteration.
When implementing logistic regression with gradient descent, we have all of these different parameter values, theta zero down to theta n, that we need to update using this expression. And one thing we could do is have a for loop. So for i equals zero to n, or for i equals one to n plus one. So update each of these parameter values in turn. But of course rather than using a for loop, ideally we would also use a vector rise implementation. So that a vector rise implementation can update all of these m plus one parameters all in one fell swoop. And to check your own understanding, you might see if you can figure out how to do the vector rise implementation with this algorithm as well.
Feature Scaling
- Machine Learning - Logistic Regression - Two-class Classification
- Machine Learning - Logistic Regression - Multi-class Classification
- Machine Learning—Classification and logistic regression
- Machine Learning by Andrew Ng --- Logistic Regression of Multi-class Classification
- 【Machine Learning实验2】 Logistic Regression求解classification问题
- 【Machine Learning实验2】 Logistic Regression求解classification问题
- Machine Learning:Logistic Regression
- [Machine Learning]--Logistic Regression
- Machine Learning--logistic regression
- Machine Learning:Logistic Regression
- Machine Learning by Andrew Ng --- Logistic Regression with two classes
- Machine Learning Notes - Logistic Regression
- Machine Learning - Regularized Logistic Regression
- machine learning之logistic regression
- 斯坦福公开课Machine Learning笔记(二)--Classification and Logistic Regression
- 转 machine learning 之logistic regression
- 【Machine Learning】逻辑回归 Logistic Regression
- Coursera Machine Learning Week 3.1: Logistic Regression
- Session的使用和细节
- visual 2008启动exe时报错:プロシージャ エントリ ポイント ?_Xweak@tr1@std@@YAXXZ がダイナミック リンク ライブラリ MSVCP90D.dll から見つかりません
- 加密解密工具类 EncryptUtil
- XSS探究:不安全的EZDATA脚本
- 极分享发布于2015-11-05 16:54 1/440 45个非常有用的 Oracle 查询语句
- Machine Learning - Logistic Regression - Two-class Classification
- C#绑定快捷键
- iOS环信3.0集成 (二)UI文件集成
- setbuf、setvbuf
- java List深度克隆
- 手机开发中Systemui客户常改颜色部分
- _BSMachError: (os/kern) invalid capability (20)警告
- kafka入门:简介、使用场景、设计原理、主要配置及集群搭建(转)
- spoj PROOT