Machine Learning - Solving the Problem of Overfitting: Regularization

来源：互联网发布：淘宝身份证照片删除编辑：程序博客网时间：2024/05/17 06:50

This series of articles are the study notes of " Machine Learning ", by Prof. Andrew Ng., Stanford University. This article is the notes of week 3, Solving the Problem of Overfitting. This article contains some topic about regularization, including overfitting, and how to implementation linear regression and logistic regression with regularization to addressing overfitting.

Solving the Problem of Overfitting

1. The problem of overfitting

By now, we've seen a couple different learning algorithms, linear regression and logistic regression. They work well for many problems, but when you apply them to certain machine learning applications, they can run into a problem called overfitting that can cause them to perform very poorly. In this section, we're going to explain what is this overfitting problem, and in the next few sections after this, we'll talk about a technique called regularization, that will allow us to ameliorate or to reduce this overfitting problem and get these learning algorithms to maybe work much better.

What is overfitting?

Let's keep using our running example of predicting housing prices with linear regression where we want to predict the price as a function of the size of the house.

Example: Linear regression (housing prices)

Underfit (High bias)

One thing we could do is fit a linear function to this data, and if we do that, maybe we get that sort of straight line fit to the data. But this isn't a very good model. Looking at the data, it seems pretty clear that as the size of the housing increases, the housing prices plateau, or kind of flattens out as we move to the right and so this algorithm does not fit the training and we call this problem underfitting, and another term for this is that this algorithm has high bias. Both of these roughly mean that it's just not even fitting the training data very well.

The term is kind of a historical or technical one, but the idea is that if a fitting a straight line to the data, then, it's as if the algorithm has a very strong preconception, or a very strong bias that housing prices are going to vary linearly with their size and despite the data to the contrary. Despite the evidence of the contrary is preconceptions still are bias, still closes it to fit a straight line and this ends up being a poor fit to the data.

Just right

Now, in the middle, we could fit a quadratic functions enter and, with this data set, we fit the quadratic function, maybe, we get that kind of curve and, that works pretty well. And, at the other extreme, would be if we were to fit, say a fourth other polynomial to the data.

And in the middle, there isn't really a name but I'm just going to write, you know, just right. Where a second degree polynomial, quadratic function seems to be just right for fitting this data.

Overfitting (High variance)

So, here we have five parameters, theta zero through theta four, and, with that, we can actually fill a curve that process through all five of our training examples. You might get a curve that looks like this. That, on the one hand, seems to do a very good job fitting the training set and, that is processed through all of my data, at least. But, this is still a very wiggly curve, right? So, it's going up and down all over the place, and, we don't actually think that's such a good model for predicting housing prices. So, this problem we call overfitting, and, another term for this is that this algorithm has high variance.

The term high variance is another historical or technical one. But, the intuition is that, if we're fitting such a high order polynomial, then, the hypothesis can fit, you know, it's almost as if it can fit almost any function and this face of possible hypothesis is just too large, it's too variable. And we don't have enough data to constrain it to give us a good hypothesis so that's called overfitting.

Overfitting:

If we have too many features, the learned hypothesis may fit the training set very well,

but fail to generalize to new examples (predict prices on new examples).

To recap a bit the problem of over fitting comes when if we have too many features, then to learn hypothesis may fit the training side very well. So, your cost function may actually be very close to zero or may be even zero exactly, but you may then end up with a curve like this that, you know tries too hard to fit the training set, so that it even fails to generalize to new examples and fails to predict prices on new examples as well, and here the term generalized refers to how well a hypothesis applies even to new examples.

Example: Logistic regression

A similar thing can apply to logistic regression as well. Here is a logistic regression example with two features x₁ andx₂.

Underfit (High bias)

One thing we could do, is fit logistic regression with just a simple hypothesis like this, where, as usual, G is my sigmoid function. And if you do that, you end up with a hypothesis, trying to use, maybe, just a straight line to separate the positive and the negative examples. And this doesn't look like a very good fit to the hypothesis. So, once again, this is an example of underfitting or of the hypothesis having high bias.

Just right

In contrast, if you were to add to your features these quadratic terms, then, you could get a decision boundary that might look more like the middle one. And, you know, that's a pretty good fit to the data. Probably, about as good as we could get, on this training set.

Overfitting (High variance)

finally, at the other extreme, if you were to fit a very high-order polynomial, if you were to generate lots of high-order polynomial terms of speeches, then, logistical regression may contort itself, may try really hard to find a decision boundary that fits your training data or go to great lengths to contort itself, to fit every single training example well.

This doesn't look like a very good hypothesis, for making predictions. And so, once again,this is an instance of overfitting and, of a hypothesis having high variance and, being unlikely to generalize well to new examples.

How to addressing overfitting?

lets talk about the problem of, if we think overfitting is occurring,what can we do to address it?

Plotting the hypothesis

In the previous examples, we had one or two dimensional data so, we could just plot the hypothesis and see what was going on and select the appropriate degree polynomial. And we could then use figures like these to select an appropriate degree polynomial. So plotting the hypothesis, could be one way to try to decide what degree polynomial to use.

But plotting the hypothesis doesn't always work. And, in fact more often we may have learning problems that where we just have a lot of features. And there is not just a matter of selecting what degree polynomial. And, in fact, when we have so many features, it also becomes much harder to plot the data and it becomes much harder to visualize it, to decide what features to keep or not.

Options to deal with overfitting:

1.Reduce number of features.

― Manually select which features to keep.
― Model selection algorithm(later in course).

2. Regularization.

― Keep all the features, but reduce magnitude/values of parametersθ_j.
― Works well when we have a lot of features, each of which contributes a bit to predicting y.

Try to reduce the number of features

But, if we have a lot of features, and, very little training data, then, over fitting can become a problem. In order to address over fitting, there are two main options for things that we can do.

The first option is, to try to reduce the number of features. Concretely, one thing we could do is manually look through the list of features, and, use that to try to decide which are the more important features, and, therefore, which are the features we should keep, and, which are the features we should throw out. Later in this course, where also talk about model selection algorithms. Which are algorithms for automatically deciding which features to keep and, which features to throw out.

This idea of reducing the number of features can work well, and, can reduce over fitting.

But, the disadvantage is that, by throwing away some of the features, is also throwing away some of the information you have about the problem. For example, maybe, all of those features are actually useful for predicting the price of a house, so, maybe, we don't actually want to throw some of our information or throw some of our features away.

Regularization

The second option, which we'll talk about in the next few videos, is regularization. Here, we're going to keep all the features, but we're going to reduce the magnitude or the values of the parameters theta J. And, this method works well, we'll see, when we have a lot of features, each of which contributes a little bit to predicting the value of Y, like we saw in the housing price prediction example.

2. Cost Function

In this section, I'd like to convey to you, the main intuitions behind how regularization works. And, we'll also write down the cost function that we'll use, when we were using regularization.

Intuition

In the previous video, we saw that, if we were to fit a quadratic function to this data, it gives us a pretty good fit to the data. Whereas, if we were to fit an overly high order degree polynomial, we end up with a curve that may fit the training set very well, but, really not be a, but overfit the data poorly, and, not generalize well.

Suppose we penalize and make θ₃,θ₄ really small.

Penalizethe paramenters

Consider the following, suppose we were to penalize, and, make the parametersθ₃ andθ₄ really small. Here is our optimization problem, where we minimize our usual squared error cause function. Let's say I take this objective and modify it and add to it, plus 1000θ₃ squared, plus 1000θ₄ squared. 1000 I am just writing down as some huge number. Now, if we were to minimize this function, the only way to make this new cost function small is ifθ₃ andθ₄ are small. Because otherwise, if you have a thousand timesθ₃, this new cost functions going to be big. So when we minimize this new function we are going to end up withθ₃ close to 0 andθ₄ close to 0, and as if we're getting rid of these two terms over there. Then we are being left with a quadratic function, and, so, we end up with a fit to the data, that's, you know, quadratic function plus maybe, tiny contributions from small terms,θ₃ and θ₄, that they may be very close to 0. And, so, we end up with essentially, a quadratic function, which is good. Because this is a much better hypothesis.

Get a simpler hypothesis

In this particular example, we looked at the effect of penalizing two of the parameter values being large. More generally, here is the idea behind regularization. The idea is that, if we have small values for the parameters, then, having small values for the parameters,will somehow, will usually correspond to having a simpler hypothesis. So, for our last example, we penalize justθ₃ andθ₄ andwhen both of these were close to zero, we will end up with a much simpler hypothesis that was essentially a quadratic function. But more broadly, if we penalize all the parameters usually that, we can think of that, as trying to give us a simpler hypothesis as well. Because when, these parameters areas close as you in this example, that gave us a quadratic function. But more generally, it is possible to show that having smaller values of the parameters corresponds to usually smoother functions as well for the simpler. And which are therefore, also, less prone to over fitting.

Regularization Cost Function

So we have a hundred or a hundred one parameters. And we don't know which ones to pick, we don't know which parameters to try to pick, to try to shrink. So, in regularization, what we're going to do, is take our cost function, here's my cost function for linear regression. And what I'm going to do is, modify this cost function to shrink all of my parameters, because, you know, I don't know which one or two to try to shrink. So I am going to modify my cost function to add a term at the end.

When I add an extra regularization term at the end to shrink every single parameter and so this term we tend to shrink all of my parameters theta 1, theta 2, theta 3 up to theta 100. By the way, by convention the summation here starts from one so I am not actually going penalize theta zero being large. That sort of the convention that, the sum I equals one through N, rather than I equals zero through N. But in practice, it makes very little difference, and, whether you include, you know, theta zero or not, in practice, make very little difference to the results. But by convention, usually, we regularize only theta through theta 100.

Choose there gularization parameterλ

Here's J(θ) where, this term on the right is a regularization term andλhere is called theregularization parameter and what lambda does, is it controls a trade off between two different goals.

The first goal, capture it by the first goal objective, is that we would like to train, is that we would like to fit the training data well.

The second goalis, we want to keep the parameters small, and that's captured by the second term, by there regularization objective. And by there regularization term. And what lambda, the regularization parameter does is the controls the trade of between these two goals, between the goal of fitting the training set well and the goal of keeping the parameter plan small and therefore keeping the hypothesis relatively simple to avoid overfitting.

In regularized linear regression, we choose θto minimize:

What if λ is set toan extremely large value (perhaps for too large for our problem, sayλ=10¹⁰)?

Algorithm works fine; setting λ to be very large can’t hurt it
Algorithm fails to eliminate overfitting.
Algorithm results in underfitting. (Fails to fit even training data well).
Gradient descent will fail to converge.

θ₁≈0, θ₂≈0, θ₃≈0,θ₄≈0,

h_θ(x)=θ₀

In regularized linear regression, if the regularization parameter monitor is set to be very large, then what will happen is we will end up penalizing the parameters theta 1, theta 2, theta 3, theta 4 very highly. That is, if our hypothesis is this is one down at the bottom. And if we end up penalizing theta 1, theta 2, theta 3, theta 4 very heavily, then we end up with all of these parameters close to zero, right? Theta 1 will be close to zero; theta 2 will be close to zero. Theta three and theta four will end up being close to zero. And if we do that, it's as if we're getting rid of these terms in the hypothesis so that we're just left with a hypothesis that will say that.

It says that, well, housing prices are equal to theta zero, and that is akin to fitting a flat horizontal straight line to the data. And this is an example of underfitting, and in particular this hypothesis, this straight line it just fails to fit the training set well. It's just a fat straight line, it doesn't go, you know, go near. It doesn't go anywhere near most of the training examples. And another way of saying this is that this hypothesis has too strong a preconception or too high bias that housing prices are just equal to theta zero, and despite the clear data to the contrary, you know chooses to fit a sort of, flat line, just a flat horizontal line.

So for regularization to work well, some care should be taken, to choose a good choice for the regularization parameter lambda as well.And when we talk about multi-selection later in this course, we'll talk about a way, a variety of ways for automatically choosing the regularization parameter lambda as well.

1 0