[Machine Learning][Linear Regression]Feature Scaling

来源:互联网 发布:curl windows 64 下载 编辑:程序博客网 时间:2024/06/06 01:28

Introduction

When I use gradient descent to get the h(x) which is similar to ‘x^2 + 2*x + 1’, I find a problem that the alpha need to be small like 0.000001, otherwise the variable can’t be regressed, so that the training will become unbearable slow, so I use the Feature Scaling.

Concept

For example, there are two features.
The first feature ranges from [1,100], and the second feature ranges from [1,10000].
So if we draw the contour map, it may be like this( the image on the left):

So, if we narrow the feature to the range [-1,1] (or little small or large eg.[-3,3] (max) [-1/3,1/3] (min)), the gradient descent will be like the image on the right, which will take much less time.

So, we find the max and min data in each column to get the standard deviation, and make each column data divide this number.

For example, we have database X like this:

X =

1 90 8100
1 3 9
1 68 4624
1 43 1849
1 4 16
1 88 7744
1 76 5776
1 21 441
1 12 144
1 60 3600
1 5 25
1 35 1225
1 24 576
1 5 25
1 90 8100
1 62 3844
1 6 36
1 82 6724
1 77 5929
1 15 225
1 38 1444
1 48 2304
1 46 2116
1 92 8464
1 21 441
1 45 2025

Just by the feature Scaling:

X = [ones(m,1),X(:,2) ./ (max(X,[],1) - min(X,[],1))(1,2),X(:,3) ./ (max(X,[],1) - min(X,[],1))(1,3)]

We can get the X’:

X =

1.0000e+000 9.0909e-001 8.1008e-001
1.0000e+000 3.0303e-002 9.0009e-004
1.0000e+000 6.8687e-001 4.6245e-001
1.0000e+000 4.3434e-001 1.8492e-001
1.0000e+000 4.0404e-002 1.6002e-003
1.0000e+000 8.8889e-001 7.7448e-001
1.0000e+000 7.6768e-001 5.7766e-001
1.0000e+000 2.1212e-001 4.4104e-002
1.0000e+000 1.2121e-001 1.4401e-002
1.0000e+000 6.0606e-001 3.6004e-001
1.0000e+000 5.0505e-002 2.5003e-003
1.0000e+000 3.5354e-001 1.2251e-001
1.0000e+000 2.4242e-001 5.7606e-002
1.0000e+000 5.0505e-002 2.5003e-003
1.0000e+000 9.0909e-001 8.1008e-001
1.0000e+000 6.2626e-001 3.8444e-001
1.0000e+000 6.0606e-002 3.6004e-003
1.0000e+000 8.2828e-001 6.7247e-001
1.0000e+000 7.7778e-001 5.9296e-001
1.0000e+000 1.5152e-001 2.2502e-002
1.0000e+000 3.8384e-001 1.4441e-001
1.0000e+000 4.8485e-001 2.3042e-001
1.0000e+000 4.6465e-001 2.1162e-001
1.0000e+000 9.2929e-001 8.4648e-001
1.0000e+000 2.1212e-001 4.4104e-002
1.0000e+000 4.5455e-001 2.0252e-001
1.0000e+000 1.1111e-001 1.2101e-002
1.0000e+000 1.9192e-001 3.6104e-002
1.0000e+000 4.5455e-001 2.0252e-001
1.0000e+000 9.5960e-001 9.0259e-001
1.0000e+000 4.4444e-001 1.9362e-001
1.0000e+000 9.3939e-001 8.6499e-001
1.0000e+000 7.9798e-001 6.2416e-001
1.0000e+000 8.7879e-001 7.5698e-001
1.0000e+000 1.0000e+000 9.8020e-001
1.0000e+000 2.1212e-001 4.4104e-002
1.0000e+000 7.1717e-001 5.0415e-001
1.0000e+000 9.7980e-001 9.4099e-001
1.0000e+000 6.4646e-001 4.0964e-001
1.0000e+000 7.4747e-001 5.4765e-001
1.0000e+000 8.7879e-001 7.5698e-001

And this will only take thousands of steps, compared with the millions of steps.

Expand

Except Feature Scaling, we can use another way to optimise the gradient descent.
Some data may range from [0, 5000] ,while some data may range from [-200,200].
It’s obvious that [-1,1] is much better than [0,2].
So we can make every data to the range [-range,range].
Guided by this theropy, we can get the following expression:

X=XAverageOf(X)StandardDeviation

This will also make the training faster.

阅读全文
0 0