统计学习方法——第六章logistic递归
来源:互联网 发布:花店网络营销策划案 编辑:程序博客网 时间:2024/06/09 23:57
Logistic回归是统计学中的经典分类方法,最大熵是概率模型学习的一个准则,将其推广到分类问题得到最大熵模型,logistic回归模型与最大熵模型都是对数线性模型。
【Logistic回归-算法】
可参考博客
《统计学习方法》简述http://blog.csdn.net/BOBOyspa/article/details/78247157
讲解+代码
http://blog.csdn.net/zouxy09/article/details/20319673?locationNum=6&fps=1
基本讲解
如下图,logistic(逻辑斯谛)回归模型的引入、模型的参数估计(极大对数似然估计以及梯度下降法):
最终类别的判定
对于给定的样本x,利用二项逻辑斯蒂回归模型计算该样本类别为1和0的概率,然后,将样本x分类到概率较大的那一类。
将h=0.5作为一个阀值,当估计值大于0.5时把样本分为1类,估计值小于0.5时把样本分为0类。
*logistic回归 VS logistic回归:
logistic回归实际上就是对线性回归多增加了一个函数映射,使其值域由无穷区间映射到[0,1]区间。实现此功能需要在输出加一个logistic函数。
算法流程如下:
书上算法:
参数估计:
对数似然函数,将L(w)作为损失函数,用随机梯度下降的方法求解;每次随机选取一个误分类点,用上述梯度对w进行更新。
【代码】 from机器学习实战
可参考的另一个博客 http://blog.csdn.net/wds2006sdo/article/details/53084871
逻辑斯谛回归模型正确率maybe优于感知器模型的,原因浮点数精度等问题。
Step1:载入数据
def loadData(): train_x = [] train_y = [] with open('test_set.txt') as f: text = f.readlines() for line in text: train_x.append([1] + line.split()[:-1]) train_y.append([line.split()[-1]]) #此处[[ ]]是为了方便转为数组 return np.mat(train_x).astype(float),np.mat(train_y).astype(float) #此处float便于后续计算
*Note1:
当对array进行约简运算时,经常二维数组变成一维array;运算时会broadcast效果,可能报错;所以此处用matrix类型。
Step2:训练
1)计算sigmoidfunction
def sigmoid(z): return 1 / (1 + np.exp(-z))
2)训练
def trainLogRegres(train_x,train_y,opts): numSample,numFeature = train_x.shape #样本数numSample,特征数numFeature weights = np.ones((numFeature,1)) maxIter = opts['maxIter'] alpha = opts['alpha'] for k in range(maxIter): if opts['optimizeType'] == 'gradDescent': ## gradient descent algorilthm output = sigmoid(train_x * weights) #train_x:(numSample,numFeature) error = train_y - output #error:(numSample,1) weights += alpha * train_x.transpose() * error elif opts['optimizeType'] == 'stocGradDescent': #stochastic gradient descent for i in range(numSample): alpha = 4 / (10 + k + i) #约束 output = sigmoid(train_x[i] * weights) #train_x[i]:(1,numFeature) weights:(numFeature,1) error = train_y[i] - output #error:(1,1) weights += alpha * train_x[i].transpose() * error elif opts['optimizeType'] == 'min-batch-gradDescent': #set b = 10 b = 10 for i in range(0,b,numSample): output = sigmoid(train_x[i:i+b] * weights) #train_x[i]:(b,numFeature) error = train_y[i:i+b] - output #error:(numSample,1) weights += alpha * train_x[i:i+b].transpose() * error else: raise NameError('Not support optimize method type!') return weights
Step3:测试
def testLogRegres(weights, test_x, test_y): output = sigmoid(test_x * weights) predict = output > 0.5 accuracy = sum(predict == test_y)/len(test_y) return accuracy
Step4:可视化
def showLogRegres(weights, train_x, train_y): for i in range(len(train_y)): if train_y[i] == 1: plt.plot(train_x[i,1], train_x[i,2],'or') else: plt.plot(train_x[i,1], train_x[i,2],'ob') min_x = float(min(train_x[:,1])) max_x = float(max(train_x[:,1])) min_y = (- weights[0,0] - weights[1,0] * min_x) / weights[2,0] #注意此处x2和x1的关系 max_y = (- weights[0,0] - weights[1,0] * max_x) / weights[2,0] plt.plot([min_x,max_x],[min_y,max_y],'g-')#绘制分割线 plt.xlabel('x1') plt.ylabel('x2')
主函数
import timeimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cross_validation import train_test_splitif __name__ == '__main__': ##step1:load data print('step1:Start loading data...') time_1 = time.time() train_set,train_label = loadData() train_x,test_x,train_y,test_y = train_test_split(train_set,train_label,test_size=0.33, random_state=23323) time_2 = time.time() print(' Loading data costs %f seconds.\n'%(time_2 - time_1)) ##step2:training... print('step2:Start training...') opts = {'alpha': 0.01, 'maxIter': 100, 'optimizeType': 'stocGradDescent'} optimalWeights = trainLogRegres(train_x,train_y,opts) time_3 = time.time() print(' Training data costs %f seconds.\n'%(time_3 - time_2)) ##step3:testing... print('step3:Start testing...') accuracy = testLogRegres(optimalWeights, test_x, test_y) print(' accuracy:%f'%accuracy) time_4 = time.time() print(' Testing data costs %f seconds.\n'%(time_4 - time_3)) ##step4:plot the figure... print('step4:Start plotting the figure...') showLogRegres(optimalWeights, train_x, train_y) time_4 = time.time() print(' Plotting the figure costs %f seconds.\n'%(time_4 - time_3))
输出
step1:Start loading data...
Loading data costs 0.001000 seconds.
step2:Start training...
Training data costs 0.057000 seconds.
step3:Start testing...
accuracy:0.969697
Testing data costs 0.001000 seconds.
step4:Start plotting the figure...
Plotting the figure costs 0.112000 seconds.
- 统计学习方法——第六章logistic递归
- 《统计学习方法笔记》——Logistic回归
- 逻辑斯谛回归(Logistic regression)—《统计学习方法》
- 李航—统计学习方法
- 李航《统计学习方法》第六章——用Python实现逻辑斯谛回归(MNIST数据集)
- 李航《统计学习方法》第六章——用Python实现最大熵模型(MNIST数据集)
- 《统计学习方法,李航》:6、logistic regression model
- 《统计学习方法》笔记(七)--logistic 回归
- 统计学习方法之逻辑回归(Logistic Regression)
- 统计学习方法——第五章决策树
- 统计学习方法第一章笔记——统计学习方法概论
- 统计学习方法笔记(1)——统计学习方法概论
- 统计学习方法(一)——统计学习方法概论
- 《统计学习方法》——概论
- 统计学习方法——绪论
- 统计学习方法——决策树
- 统计学习方法-Logistic(逻辑斯蒂)回归
- 《统计学习方法》 逻辑斯谛回归(logistic regression) Python实现
- 计算机网络系列(6)之应用层HTTP
- Pygame 学习笔记(2)
- 在 Windows 上使用 Idea 本地调试 Hadoop 程序
- 扫雷小游戏
- Java虚拟机工作原理
- 统计学习方法——第六章logistic递归
- ubuntu14.04安装搜狗输入法的方法
- 面试总结
- 新媒体运营,微博运营活动技巧分享
- Python中的异常
- 51nod 1020 逆序排列
- 【面向对象】--静态类与非静态类的区别
- 在win10下Anaconda+Tensorflow+pycharm配置
- 别让你的缓存形同虚设,轻松搞定缓存击穿