林轩田-机器学习基石-作业4-python源码
来源:互联网 发布:chart.js柱状图的宽度 编辑:程序博客网 时间:2024/06/10 23:29
正则化线性回归(regularized linear regression)和验证(validation)的实验
对于正则化线性回归下的分类问题,我们有Wreg = arg min( λ/N * ||W||^2 + 1/N * ||XW-Y||^2 )
以下问题均基于
## 训练集:http://www.csie.ntu.edu.tw/~htlin/course/ml15fall/hw4/hw4_train.dat## 测试集:http://www.csie.ntu.edu.tw/~htlin/course/ml15fall/hw4/hw4_test.dat
因为以下问题均为分类问题,所以Eout按照0/1error来计算
问题13:当λ = 11.26时,相应的Ein和Eout是多少?
根据课堂上的推导,我们求得了Wreg = (Xt*X + λI)^-1*XtY,所以该问题只需要把相应的λ代入求Wreg,然后再计算Ein和Eout就好
import numpy as npimport matplotlib.pyplot as plt### import training data or testing datadef data_load(path): ### open data file and read all lines f = open(path) try: lines = f.readlines() finally: f.close() ### create return data structure example_num = len(lines) feature_dimension = len(lines[0].strip().split()) ### how to get dimension by the easier way X = np.zeros((example_num, feature_dimension)) Y = np.zeros((example_num, 1)) X[:,0] = 1 ### parse every line for index,line in enumerate(lines): ### get features items = line.strip().split() X[index,1:] = [float(_) for _ in items[:-1]] Y[index] = float(items[-1]) return X,Y
因为是二维特征,满足一下好奇心,画散点图看看样本点的分布情况(为了让显示清晰,我绘制了两幅图,第二幅图显示了正负类交界处的情况)
我们可以使用plt.scatter(x, y, s=area1, marker=’^’, c=c)函数来绘制散点图,其中
X–横坐标
Y–纵坐标
S–每个坐标的大小
marker–标记的形状
c–标记的颜色
def data_visual(X, Y, title="Default", xmin=0, xmax=1, ymin=0, ymax=1, func=False, w=[]): x1 = X[:,1] x2 = X[:,2] labels = Y[:,0] ### get size array dot_size = 20 size = np.ones((len(x1)))*dot_size ### get masked size array for positive points(mask the negative points) s_x1 = np.ma.masked_where(labels<0, size) ### get masked size array for positive points(mask the positive points) s_x2 = np.ma.masked_where(labels>0, size) ### plot positive points as x plt.scatter(x1, x2, s_x1, marker='x', c='r', label="positive") ### plot negative points as o plt.scatter(x1, x2, s_x2, marker='o', c='b', label="negative") ### plot func if require if func: x1_dot = np.arange(xmin,xmax,0.01) x2_dot = np.arange(ymin,ymax,0.01) x1_dot,x2_dot = np.meshgrid(x1_dot, x2_dot) f = w[0,0] + w[1,0]*x1_dot + w[2, 0]*x2_dot plt.contour(x1_dot, x2_dot, f, 0) ### add some labels plt.xlabel('X1') plt.ylabel('X2') plt.title(title) plt.legend() plt.xlim(xmin, xmax) plt.ylim(ymin, ymax) plt.show()### load and plot training dataX_train,Y_train = data_load('hw4_train.dat')data_visual(X_train, Y_train, "Example data")data_visual(X_train, Y_train, "Partial Example data", 0.45,0.55,0.45,0.55)
求解Ein和Eout,求出来可得Ein=0.055 Eout=0.052.
第一次看到结果,我是感到比较诧异的。因为如果根据上面的样本的可视化图片,如果用linear regression来拟合的话按理来说应该会拟合一条穿过所有点的直线(linear regression采用的是squared error的cost function,穿过所有点的直线的squared error应该是最小的),而不应该是下图中的那条直线。但是细想,哦,这不就是regularization的作用吗,我们的L2regularizer对我们的参数的||W||^2做了限制,λ=11.26(估计是某个纪念日吧,只有林老师才知道),这个λ是一个很大的数字,同时也意味着对W的限制很严格,所以我们才得到了那么一条‘意料之外’的直线。然后我又多试了几个λ值。(读者可自行拷贝代码然后运行)
在λ = 11.26 的时候,拟合出来的参数向量是[-0.88765371, 1.00531461, 1.00530077]
在λ = 1 的时候,拟合出来的参数向量是[-1.42004196, 1.48896417, 1.48882476]
在λ = 0.01 的时候,拟合出来的参数向量是[-1.5033133, 1.63252051, 1.49578124]
在λ = 0 的时候,拟合出来的参数向量是[-1.49776048e+00, 3.67237446e+03, -3.66924721e+03]
由此可见,当我们的λ由大变小的时候,算法对W向量的限制也越来越低(表现为||W||^2越来越大),当λ=0时,对W无限制,就拟合出来我们心中认为最合理的那条穿过所有样本点的直线
class LinearRegressionReg: def __init__(self): self._dimension = 0 def fit(self, X, Y, lamb): self._dimension = len(X[0]) self._w = np.zeros((self._dimension,1)) self._lamb = lamb self._w = np.linalg.inv(np.dot(X.T, X) + lamb*np.eye(self._dimension)).dot(X.T).dot(Y) def predict(self, X): result = np.dot(X, self._w) return np.array([(1 if _ >= 0 else -1) for _ in result]).reshape(len(X), 1) def score(self, X, Y): Y_predict = self.predict(X) return sum(Y_predict!=Y)/(len(Y)*1.0) def get_w(self): return self._w def print_val(self): print "w: ", self._w### Error inlr = LinearRegressionReg()lr.fit(X_train, Y_train, 11.26)Ein = lr.score(X_train, Y_train)data_visual(X_train, Y_train, title="Training Data", xmin=0.4, xmax=0.6, ymin=0.35, ymax=0.6, func=True, w=lr.get_w())lr.print_val()print "Ein : ", Ein### Error outX_test, Y_test = data_load('hw4_test.dat')Eout = lr.score(X_test, Y_test)data_visual(X_test, Y_test, title="Test Data", xmin=0.4, xmax=0.6, ymin=0.35, ymax=0.6, func=True, w=lr.get_w())print "Eout : ", Eout
w: [[-0.88765371] [ 1.00531461] [ 1.00530077]]Ein : [ 0.055]
Eout : [ 0.052]
问题14/15:画出λ在分别取值{ 10^2, 10^1, 10^0……10^-8, 10^-9, 10^-10}时的Ein的曲线图,分别求出Ein最小时的λ和Eout最小时的λ(Ein或者Eout相同时,取最大的λ)
### get lambslog_lambs = range(2, -11, -1)lambs = [10**_ for _ in range(2, -11, -1)]Ein = []Eout = []lr = LinearRegressionReg()### get Ein and Eout arrayfor index,lamb in enumerate(lambs): ### fit models lr.fit(X_train, Y_train, lamb) ### get Ein Ein.append(lr.score(X_train, Y_train)) ### get Eout Eout.append(lr.score(X_test, Y_test))### plot Ein and Eout curveplt.plot(log_lambs, Ein, label="Error In", marker='o')plt.plot(log_lambs, Eout, label="Error Out", marker='o')plt.title("Curve of Error")plt.xlabel("Log lambda")plt.ylabel("Error")plt.legend()plt.show()print ("λ = %e with minimal Ein: %f"%(lambs[Ein.index(min(Ein))], min(Ein)))print ("λ = %e with minimal Eout: %f"%(lambs[Eout.index(min(Eout))], min(Eout)))
λ = 1.000000e-08 with minimal Ein: 0.000000λ = 1.000000e-02 with minimal Eout: 0.021000
接下来的问题和validation有关。将我们的样本分为训练集(120)和测试集(80),将所有的模型(不同的lambda值)在训练集上训练,然后讲得出的假设在测试集上验证
问题16/17:single validation
求出所有λ对应的Etrain,Eval,Eout,分别找出最小Etrain的λ和最小Eval的λ
从下面的结果以及图片可以看出
1,Etrain,Eval,Eout的三条曲线整体趋势一致,也符合我们课堂上的结论。
2,根据Etrain选出来的结果比Eval选出来的结果好那么一点点,其实也差不多。λ=1.000000e-08时的Eout最小,所以说根据Eval来选择并不是总优于Etrain??
### get train set and test setX,Y = data_load('hw4_train.dat')X_tra = X[0:120,:]Y_tra = Y[0:120,:]X_val = X[120:,:]Y_val = Y[120:,:]### log_lambs = range(2, -11, -1)lambs = [10**_ for _ in range(2, -11, -1)]Etrain = []Eout = []Eval = []lr = LinearRegressionReg()### get Etrain and Eval and Eout arrayfor index,lamb in enumerate(lambs): ### fit models lr.fit(X_tra, Y_tra, lamb) ### get Etrain Etrain.append(lr.score(X_tra, Y_tra)) ### get Eval Eval.append(lr.score(X_val, Y_val)) ### get Eout Eout.append(lr.score(X_test, Y_test))### plot Ein and Eval and Eout curveprint len(Ein)print len(log_lambs)plt.plot(log_lambs, Etrain, label="Error Training", marker='o')plt.plot(log_lambs, Eval, label="Error Validation", marker='o')plt.plot(log_lambs, Eout, label="Error Out", marker='o')plt.title("Curve of Error")plt.xlabel("Log lambda")plt.ylabel("Error")plt.legend()plt.show()print ("λ = %e with minimal Etrain: %f, Eout: %f"%(lambs[Etrain.index(min(Etrain))], min(Etrain),\ Eout[Etrain.index(min(Etrain))]))print ("λ = %e with minimal Eval: %f, Eout: %f"%(lambs[Eval.index(min(Eval))], min(Eval),\ Eout[Eval.index(min(Eval))]))
3913
λ = 1.000000e-08 with minimal Etrain: 0.000000, Eout: 0.025000λ = 1.000000e+00 with minimal Eval: 0.037500, Eout: 0.028000
18题:根据16/17题选出的λ(1.000000e-08),利用所有的样本来训练该λ对应的模型,求Ein和Eout
结果显示,使用整体样本训练出来的最优模型的Eout比仅使用Etrain训练出来的最优模型的Eout要小一些
### load dataX_train,Y_train = data_load('hw4_train.dat')### fit model use lr = LinearRegressionReg()lr.fit(X_train, Y_train, 1.000000e-08)Ein = lr.score(X_train, Y_train)Eout = lr.score(X_test, Y_test)print "Error in : ", Einprint "Error out: ", Eout
Error in : [ 0.015]Error out: [ 0.02]
接下来的内容是cross validation。我们将样本平均分为5份,每份有40个样本
19题:使用交叉验证,得出Ecv最小的λ
### get train set and test setX,Y = data_load('hw4_train.dat')### log_lambs = range(2, -11, -1)lambs = [10**_ for _ in range(2, -11, -1)]Ecv = []lr = LinearRegressionReg()num = len(Y)fold = 5size = num/foldEcv = []### get Etrain and Eval and Eout arrayfor index,lamb in enumerate(lambs): Err = [] for index,i in enumerate(np.arange(0,num,size)[0:fold]): start = index*size end = (index+1)*size if index+1 < fold else num ### print "start: ",start," end: ", end X_test = X[start:end] X_val = np.concatenate((X[:start],X[end:])) Y_test = Y[start:end] Y_val = np.concatenate((Y[:start],Y[end:])) ### fit models lr.fit(X_test, Y_test, lamb) ### get Err Err.append(lr.score(X_val, Y_val)) ### get Ecv Ecv.append(sum(Err)/(len(Err)*1.0))plt.plot(log_lambs, Ecv, label="Ecv", marker='o')plt.title("Error Cross Validation")plt.xlabel("Log lambda")plt.ylabel("Error")plt.legend()plt.show()print ("λ = %e with minimal Ecv: %f"%(lambs[Ecv.index(min(Ecv))], min(Ecv)))
λ = 1.000000e-03 with minimal Ecv: 0.033750
19题:根据18题得到的模型(λ=1.000000e-03),将模型在整个训练集上进行训练,然后在测试集上进行测试,得出Ein和Eout
### get train set and test setX,Y = data_load('hw4_train.dat')X_test,Y_test = data_load('hw4_test.dat')lr = LinearRegressionReg()lr.fit(X, Y, 1.000000e-03)Ein = lr.score(X, Y)Eout = lr.score(X_test, Y_test)print "Ein: ", Einprint "Eout: ", Eout
Ein: [ 0.03]Eout: [ 0.016]
- 林轩田-机器学习基石-作业4-python源码
- 林轩田-机器学习基石-作业1-python源码
- 林轩田-机器学习基石-作业3-python源码
- 机器学习基石第二次作业
- 《机器学习基石》作业一
- 机器学习基石作业4第5题
- 机器学习基石第四次作业代码
- 机器学习基石 4.5 作业一
- 机器学习基石作业一PLA算法
- 机器学习基石—作业2(16-20题Python实现)
- 林轩田--机器学习基石&机器学习技法
- 机器学习基石 作业4 带Regularizer和Cross Validation的线性回归分类器
- 台大-林轩田老师-机器学习基石学习笔记4
- 机器学习基石 作业1 实现PLA和Pocket算法
- 《机器学习基石》作业一第17题的程序
- 台大机器学习基石作业编程题
- 机器学习基石 作业1 程序题(15-20)
- 机器学习基石——作业2解答
- 图像旋转 python实现(针对mxn长方形尺寸图像)
- Leetcode:5. Longest Palindromic Substring
- 学习GBDT+LR
- LNMP环境搭建
- opencv 2.4.13 videoCapture无法读取视频
- 林轩田-机器学习基石-作业4-python源码
- 2、在pycharm下搭建测试环境
- Java Graphics2D绘图
- GDB调试常用命令
- 171028linux下软件安装
- ubuntu14.04中安装Pangolin出问题
- 如何在arxiv上传Latex 类型的论文
- 转载 关于增量编译
- 计算机网络概念和结构