CalTech machine learning, video 9 Review note(the Linear Model II)

来源:互联网 发布:怎样看网络直播送礼物 编辑:程序博客网 时间:2024/05/17 09:33
start review CalTech machine learning,


video 09, the Linear Model II


9:08 2014-09-28
generalization analysis


9:16 2014-09-28
linear classification:


perceptron algorithm, pocket algorithm


9:17 2014-09-28
linear surface => quadratic surface


9:20 2014-09-28
think of the VC inequality as a promise


of providing you with a warrantry, in order


for the warrantry to be valid, you can not 


look at the data before you choose the model,


that will forfeit the warrantry.


9:43 2014-09-28
I'm going to charge you if I do the analysis correctly.


not the VC dimension of the final guy you got. I'm going 


to charge you the VC dimension of the entire hypothese space


that you explore in your mind in order to getting there.


9:44 2014-09-28
you have acted as a learning algorithm unknowningly.


9:45 2014-09-28
now you look at the data, you realized that some of


the coefficient is zero,


I don't need this, I don't need this,...


you did very quickly in your mind.


9:46 2014-09-28
so the hypothese you learned is a hierarchical learning.


9:46 2014-09-28
first you learned, then you passed to the algorithm


to learn.


9:47 2014-09-28
the entire hypothese is what you start with.


9:47 2014-09-28
Lesson learned:


looking at the data before choosing the model.


9:48 2014-09-28
this can be hazard to your health, not to your health,


but the "generalization health"


9:48 2014-09-28
if you look at the data, we say that you did the learning.


9:49 2014-09-28
this is the manifestation of biggest trap that 


practioners fall into.


9:50 2014-09-28
when you go machine learning,I want you to learn from


the data, and choosing the model is very tricky.


9:51 2014-09-28
let me look at the data, and just pick something suitable.


9:52 2014-09-28
you're allowed to do that, I'm not saying that this is 


against the law. you can do it, just charge accordingly.


9:52 2014-09-28
remember if you do this, and end up with a small hypothese


set, and you have a VC dimension, you have already forfeit 


the warrantry that has given you by the VC inequality according 


to that.


9:54 2014-09-28
you snoop into the data


9:54 2014-09-28
data snooping


9:54 2014-09-28
you look at the data before you choose the model.


9:54 2014-09-28
but there're others that are so subtle that


a smart person may fall into that.


9:55 2014-09-28
I'm not minimizing the way to choose a model.


there will be ways to choose the model, when I talk


about validation, model selection will be the order 


of the day.


9:56 2014-09-28
it's a model selection that does not contaminate 


the data


9:57 2014-09-28
the data here is used for choosing the model, therefore


it's contaminated, it's no longer trusted to reflect the 


real performance, because you already used in learning.


9:58 2014-09-28
linear model is an economy car, 


nonlinear transformation gives you a truck,


you see the truck is very strong, I can go high-dimensional


space, I can have very sophisticated surface, the I warned


you be careful when you drive the truck.


9:59 2014-09-28 
logistic regression: outline


* the model


* error measure


* learning algorithm


10:00 2014-09-28
this will be very representative of what is


machine learning in large.


10:02 2014-09-28
the learning algorithm we use here will


be the same learning algorithm we'll use in


neural network next time.


10:02 2014-09-28
A third linear model: linear combination


* linear classification: h(x) = sign(s)


* linear regression


* logistic regression


10:04 2014-09-28
so let's put it into a picture, so here are you


inputs, x0, x1..., xd, x1~xd, these are your genuine inputs,


this is x0, which takes care of the threshold,


10:06 2014-09-28
weights going with these guys, and then they're summed 


in order give me s, and then one linear model or the 


other will do different things to s, the 1st model will


take s and pass through the threshold,in order to get plus


or minus one // +1 or -1


10:07 2014-09-28
what did we do to the signal in the case of linear regression?


10:08 2014-09-28
now when you go to the 3rd guy, which is called logistic 


regression: h(x) = θ(s)


10:09 2014-09-28
take s and apply a nonlinearity to it.


10:10 2014-09-28
it's not as harsh as this nonliearity.


it's somewhere between this and leaving it alone(identity).


10:11 2014-09-28
and it looks like this.


10:11 2014-09-28
this is the least I can report, this is the 


most I can report, it looks bounded like this.


10:12 2014-09-28
much like this except for the softening of it.


10:12 2014-09-28
but it's real-valued, I can return any real-value


between this & this.


10:13 2014-09-28
so it has something of the linear regression,


10:14 2014-09-28
and the main utility of logistic regression is 


that the output is going to interpreted as probability


10:14 2014-09-28
and that will cover a lot of problems where we 


want to estimate the probability of something.


10:15 2014-09-28
so let's be specific, let's look at the logistic functionθ


10:16 2014-09-28
θ(s) // s == sum


10:16 2014-09-28
it can serve as a probability, because it goes from


here 0 to 1.


10:16 2014-09-28
and if you look at the signal, if the signal is very


very negative, you close to probability 0.


10:17 2014-09-28
if the signal is very very positive, you get close to 1.


10:17 2014-09-28
and the signal zero is probability half.


10:17 2014-09-28
so the signal is correspond to the level of certainty


of something.


10:18 2014-09-28
if I have a huge signal, I'm pretty sure that 


will eventually happen.


10:19 2014-09-28
now there're many formulas I can have to give 


you this shape. this shape is what I interested in.


10:19 2014-09-28
and I'm going to choose a particular formula.


10:20 2014-09-28
it will be a very friendly formula.


10:21 2014-09-28
so this thing is called soft-threshold for obvious


reasons. the hard version will be decide this or this.


10:22 2014-09-28
so ths soften it, and give you a reliability of the 


decision.


10:22 2014-09-28
so if you think talking about the credit card application,


it used to be think the customer is good or bad?


instead of deciding the customer is good or bad, which is a 


binary classficiation, 


10:23 2014-09-28
what is the probability that this customer will 


be good or bad?


what is the probability of default?


10:25 2014-09-28
let the bandk decide what to do according this probability?


10:25 2014-09-28
the soft-threshold reflect uncertainty.


seldom do we know the binary classification is certainty.


and it might be more information gives you the certainty


as part of the deal,


and reflect in this soft threshold.


10:27 2014-09-28
it's also called sigmoid for simple reasons,


it looks like a flattend out 's'


10:28 2014-09-28
sigmoid function or soften threshold


10:28 2014-09-28
when we got to neural networks, it will be another 


close-related formulas. you can invent other formulas


if you will.


10:28 2014-09-28
so this is the logistic function, and here is the model,


so we know what the model does.


10:29 2014-09-28
the main idea is the probability intepretation.


10:29 2014-09-28
so we have the model: h(x) = θ(s)


the model is: you take the linear signal s,


pass it through this logistic function, and that


will be your value of the hypothesis function at 


x that give rise to this signal.


10:32 2014-09-28
so we think there is a probability sitting


out there generating examples, that say a probability


default based on credit information.


10:33 2014-09-28
example: unfortunate prediction of heart attack


10:34 2014-09-28
breakout of heart attacks based on a number of factors


10:34 2014-09-28
the kind of input you'll have is:


input x: cholesterol level, age, weight, etc.


10:35 2014-09-28
probability of heart attack


10:40 2014-09-28
what is the probability that you will get a


heart attack within the next 5 months?


10:40 2014-09-28
the signal s = w'x,


it's a linear sum of these guys


10:41 2014-09-28
2 things to observe:


* this remains linear 


* you can think of this as a "risk score" // credit score


10:42 2014-09-28
you just give the importance weight, and sum


them up.


10:42 2014-09-28
although it translated to probability to make 


it meaningful.


10:53 2014-09-28
I'd like to make the point that this is genuine probability.


10:54 2014-09-28
you have the hypothesis that goes from zero to one,


I'm interpreting it as a probability.


but you could think of it as a function between 0 & 1


10:55 2014-09-28
the main point here is the the output of logistic regression


genuinely as a probability even during learning.


10:56 2014-09-28
this is because the data that gives to you does not


tells you the probability.


10:57 2014-09-28
Data (x, y) with binary y


10:58 2014-09-28
I don't give you here is the 1st patient, and here


are the data, and this is supervised learning, I


have to give you the label.


10:59 2014-09-28
so the probability of getting heart-attack within 12 


months is 25 percent, how would the hell could I know


that?


11:00 2014-09-28
I can only get that someone get a heart attack or 


didn't get a heart attack.while that is affected by


the probability, but you didn't get access to the 


probability.


11:02 2014-09-28
I give you a binary output which is affected by 


the probability, 


11:03 2014-09-28
so this is a noisy case,


so this is generated by the noisy target, let's put the 


the noisy target in order to understand where these examples


come from.


11:05 2014-09-28
Data(x, y) with binary y, generated by a noisy target.


11:05 2014-09-28
P(y|x) = 1 or -1


11:05 2014-09-28
this is generated by the target that I want to learn.


11:06 2014-09-28
you want to learn the final hypothesis which 


is called g(x), which happens to have the form of


logistic regression:


g(x) = θ(w'x)


the claim you're going to end is saying that,


this is approximately g(x) ≈ f(x)


11:07 2014-09-28
you're trying to make it as true as possible 


according to some error measure we have.


11:11 2014-09-28
what is under your control is the parameters: w


// weight


11:11 2014-09-28
the question now becomes: 


how do I choose the weight


such that the logistic regression hypothesis refelect


the target function,knowing that the target function 


is the way the examples were generated.


11:13 2014-09-28
so let's talk about the error measure, 


11:25 2014-09-28
Error measure


11:25 2014-09-28
it's a very popular error measure


11:26 2014-09-28
we have the following plausible error measure,


which is based on likelihood, 


11:27 2014-09-28
likelihood is a very established notion in statistics


not without controversy, but widely applied.


11:28 2014-09-28
I'm going to grade different hypothesis according 


different likelihood that they're actually the target


that generated the data.


11:29 2014-09-28
so I can use this to build comparative way to say that


this is more plausible hypothesis than another.


because the data becomes more likely under the scenario.


this hypothesis other than that hypothesis being the real


target function.


11:30 2014-09-28
so this is the idea, you ask how likely to get y from x


if h == f?


11:31 2014-09-28
what is the most probable hypothesis given the data?


11:32 2014-09-28
but here you ask: what is the most probable data given the


hypothesis, which is backwards?


11:33 2014-09-28
this is never a completely clean thing, 


but we will sort of swallow that because it looks 


rather reasonable.


11:35 2014-09-28
under the assumption that h == f, how likely 


to get y from x?


11:36 2014-09-28
so let's use this to derive a full-fledged 


version of error measure?


11:37 2014-09-28
it's already crying for a simplification:


and the simplification is this


P(y|x) = θ(yw'x)


11:45 2014-09-28
Maximizing the likelihood, which can be 


transformed to minimizing an error measure.


11:47 2014-09-28
we're maximizing the likelihood of the hypothese


under the data set that we're given.


11:48 2014-09-28
what is the probability of the data set, under the


assumption that the hypothesis is indeed the target?


11:49 2014-09-28
maximizing respect to what?  the parameter // weight


11:50 2014-09-28
one final thing, can I do this?


11:51 2014-09-28
all you do is instead of maximizing, you minimize


11:52 2014-09-28
Ok, we're cool, so this is the problem.


11:52 2014-09-28
very sophisticated problem, we end up with something


which is rather suspicious familiar.


11:55 2014-09-28
something that involves the value of the example (xn, yn) & 


the parameters I'm trying to learn. // weight


11:55 2014-09-28
I'd like to reduce this further


11:55 2014-09-28
SGD == Stochastic Gradient Descent


11:56 2014-09-28
I'm going to officially declare it as the 


in-sample error of the logistic regression.


// Ein(w)


11:57 2014-09-28
so I minimize it, it's legitimate


11:57 2014-09-28
Ein(W) // in-sample error, error measure


11:58 2014-09-28
// e(h(xn), yn)


I'm going to call it the error measure between


my hypothesis which depends on w, apply to xn, and 


the value you give me as a label for that example 


which is yn,


that is the way we define error measure on points.


12:00 2014-09-28
label


12:00 2014-09-28
and under that, maximize the likelihood will like 


minimizing the in-sample error.


12:02 2014-09-28
there is an interesting interpretation here,


w'xn, this is what we call a risk score,


12:03 2014-09-28
that's see agreement or disagreement, and how they


affect the error measure


12:03 2014-09-28
now if the signal is very positive // W'Xn


and this guy(yn)  is plus one(+1)  // unfortunately you got a heart attack


agreement => contribution to the error measure is small


12:06 2014-09-28
disagreement => error is huge


12:06 2014-09-28
this will be an error measure that we're trying to minimize


12:06 2014-09-28
it's called "cross-entropy" error


12:07 2014-09-28
now we have defined the model, and we have defined 


the error measure, the remaining order is to do the 


learning algorithm


12:08 2014-09-28
remember linear regression, we also have an error function


12:09 2014-09-28
to minimize the linear regression error 


=> pseudo-inverse  // normal equation


// projection onto the column space C(A)


12:10 2014-09-28
but here we're out of luck, you can not find


a closed-form solution


12:11 2014-09-28
with the absense of closed-form solution, we


usually go for an iterative solution.


12:12 2014-09-28
we just improve, improve,...


finally we got the good solution.


12:12 2014-09-28
this is not a foreign concept to us, this is 


what we do in perceptrons.


12:12 2014-09-28
here we're going to do is based on calculus, 


the method we're going to use is minimization


can be applied to any error measure even nonlinear


just a little smoothness assumed.


12:14 2014-09-28
iterative method: gradient descent


12:14 2014-09-28
a function goes like this is called convex,


and it goes with "convex optimization"


12:14 2014-09-28
very simple, because wherever you start, you'll


get to the valley.


12:15 2014-09-28
imagine the most sophiscated nonlinear surface,


and then depending where you start, you sliding 


down, 


12:16 2014-09-28
error measure for neural networks


12:16 2014-09-28
statistical inference


12:16 2014-09-28
so what you do with gradient descent?


* general method for nonlinear optimization


what you do is you start at a point: w(0)


then you take a step, you try to make a improvement


using that step, 


12:17 2014-09-28
and the step is: take a step along the steepest slope


12:18 2014-09-28
the steepest slope is not an easy notion to see in


2 dimension space,because I left or right? too many 


directions.


12:18 2014-09-28
let's do the following, let's see I'm in 3D space, 


in this room, I have a very nonlinear surface, going


around up & down, up & down....


12:19 2014-09-28
I'm going to assume one thing that is twice differentiable


that is what you need to invoke gradient descent.


12:21 2014-09-28
you don't have a birds'view, you only have local information


around you, so the best thing to imagine is that you sitting


on the surface, and then you close your eyes, and all you do 


is feel around you, and then decide that this is a more promising


direction than this.that' all you do at one step, then you go


to the new point, repeat, repeat...


12:23 2014-09-28
until you get to the minimum


12:23 2014-09-28
these are all the iterative method you're going to use.


12:23 2014-09-28
we look at a fixed step size


12:24 2014-09-28
I'm going to do local approximations based on calculus,


Taylor's series, and I knonw this approxmation will be good


if the step size is not that big.


if I move far, the higher order terms kick in, I'm not sure 


the conclusion I'm going to ...will apply


12:24 2014-09-28
I'm moving a unit vector v hat


12:26 2014-09-28
and I'm going to modulate the amount of move by 


step size which I'm going to call η


12:27 2014-09-28
so OK, this is the amount of move, I already 


decide on the size, but I don't know which side


to go?


12:28 2014-09-28
w(1) = w(0) + ηv


so under this condition, you're trying to derive 


what is v hat?


12:29 2014-09-28
so let's actually try to solve for it.


12:29 2014-09-28
so we're really talking about the change in the direction


of the error.


12:30 2014-09-28
ΔEin   // change in in-sample error, one step


12:30 2014-09-28
what I want to do is that I want this guy(ΔEin) to be 


negative, as negative as possible.


12:31 2014-09-28
ΔEin = Ein(w(1)) - Ein(w(0)) // by the proper choice of w(1)


12:32 2014-09-28
ΔEin = Ein(w(0) + ηv) - Ein(w(0))


12:32 2014-09-28
using the Taylor series expansion with one term.


12:33 2014-09-28
conjugate gradient??


12:33 2014-09-28
so now you can see why it's called gradient descent,


because you descent along the gradient of your error


12:36 2014-09-28
it will take me forever to get there.


12:36 2014-09-28
but then the linear approximation may not apply, 


so there is a compromise


12:37 2014-09-28
so if you look at, the best compromise is initially have a


large η, just be more careful when get close to the minimum


12:39 2014-09-28
it's not a mathematical formula, it's an observation on surface.


12:40 2014-09-28
to have the η increase with slope


12:40 2014-09-28
easy implementation:


instead of taking the direction which will not change, 


now I'm trying to make η proportionally to the size of the 


gradient, step size bigger when the slope is bigger.


12:41 2014-09-28
now it's not a fixed step anymore, it's a fixed 


learning rate


12:42 2014-09-28
learning rate


12:43 2014-09-28
summary of linear regression algorithm:


just use "gradient descent" to update the weights w


12:44 2014-09-28
summary of linear model:


* perceptron // linear classification      // accept or deny


* linear regression                        // credit line


* logistic regression                      // probability of default


12:45 2014-09-28
if you apply each of this to "credit analysis", what type 


of thing do you implement?


12:45 2014-09-28
if you use logistic regression, you just decide 


the probability of default, then let the bank decide


what to do?


12:47 2014-09-28
so this is from the application domain, let's see


from the tool's point of view


12:48 2014-09-28
they have different error measure,


perceptron: binary classification error   // PLA, Pocket


linear regression:      squared error                 // Pseudo-inverse


logistic regression:    cross-entropy error           // Gradient descent


12:49 2014-09-28
let's see the linear regression, that is the easiest,


you have pseudo-inverse, and you have one-step learning


------------------------------------------------------------------
0 0
原创粉丝点击