Logistic回归
来源:互联网 发布:淘宝商城儿童睡袋 编辑:程序博客网 时间:2024/06/01 12:58
先不说算法基础,直接干。。。。
使用梯度上升找到最佳参数
# -*- coding: utf-8 -*-"""Created on Fri Aug 18 01:05:58 2017@author: LiLong"""from numpy import *def loadDataSet(): dataMat = []; labelMat = [] fr = open('testSet.txt') for line in fr.readlines(): #strip() 方法用于移除字符串头尾指定的字符(默认为空格) lineArr = line.strip().split() # 第一列是1,另外加的,即形如[1,x1,x2] dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])]) labelMat.append(int(lineArr[2])) return dataMat,labelMat # 此时是数组def sigmoid(inX): return 1.0/(1+exp(-inX)) # 返回一个0-1的100*1的数组# 梯度上升优化算法def gradAscent(dataMatIn, classLabels): dataMatrix = mat(dataMatIn) # 转换成numpy矩阵(matrix),dataMatrix:100*3 labelMat = mat(classLabels).transpose() # transpose转置为100*1 m,n = shape(dataMatrix) # 得到行和列数,0代表行数,1代表列数 print 'shape(dataMatrix):',shape(dataMatrix) alpha = 0.001 # 步长 maxCycles = 500 # 最大迭代次数 weights = ones((n,1)) # 3*1的矩阵,其权系数个数由给定训练集的维度值和类别值决定 print 'weights:',shape(weights) for k in range(maxCycles): h = sigmoid(dataMatrix*weights) # 利用对数几率回归,z = w0 +w1*X1+w2*X2,w0相当于b error = (labelMat - h) # 100*1 weights = weights + alpha * dataMatrix.transpose()* error # 一种极大似然估计的推到过程,但每次的权系数更新都要遍历整个数据集 return weights# 画出数据集和Logistic分类直线的函数def plotBestFit(weights): import matplotlib.pyplot as plt weights=weights.getA() # 必须转换为array dataMat,labelMat=loadDataSet() # 加载数据集,训练数据和标签 dataArr = array(dataMat) # 进行数据处理,必须转换为numpy中的array n = shape(dataArr)[0] # 样本个数 xcord1 = []; ycord1 = [] xcord2 = []; ycord2 = [] for i in range(n): if int(labelMat[i])== 1: xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2]) # 一类 else: xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2]) # 另一类 fig = plt.figure() ax = fig.add_subplot(111) ax.scatter(xcord1, ycord1, s=30, c='red', marker='s') # 散点图 ax.scatter(xcord2, ycord2, s=30, c='green') x = arange(-3.0, 3.0, 0.1) y = (-weights[0]-weights[1]*x)/weights[2] # 换算后的简化形式 ax.plot(x, y) plt.xlabel('X1'); plt.ylabel('X2') plt.show()#入口函数:模型参数y=wxdataArr,labelMat=loadDataSet()weights=gradAscent(dataArr, labelMat) # dataArr:100*3,labelMat是一个列表plotBestFit(weights)
运行结果:
首先说一点语法基础,免得代码调半天:
- numpy中的ones( )用法注意:
>>> ones(10)array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])>>> ones((10,1))array([[ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.]])>>> ones(10,1)Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Program_software\Anaconda\lib\site-packages\numpy\core\numeric.py", line 190, in ones a = empty(shape, dtype, order)TypeError: data type not understood>>> ones([10,1])array([[ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.]])>>>
- numpy.matrix.getA()用法注意:
matrix.getA()
Return self as an ndarray object.
Equivalent to np.asarray(self).
Parameters: None
Returns: ret : ndarray
weightsOut[16]: matrix([[ 4.12414349], [ 0.48007329], [-0.6168482 ]])weights.getA()Out[17]: array([[ 4.12414349], [ 0.48007329], [-0.6168482 ]])
必须转换为array,weights=weights.getA(),不转换的话回归直线出不来。。。
总结实现过程中遇到的问题:
- 在二项logistic回归中,为了方便会将权值向量和输入向量加以扩充,一般的形式比如:
这时,
在上面的代码中,是把
dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
输入数据的第一列全部是1,dataMat的第二,第三列是特征特征x2,x3,每一行形如[1,x2,x3]。
所以,代码中
h = sigmoid(dataMatrix*weights)
运算的是
- 算法用的是logistic回归:
即hw(x)=g(wTx)=11+e−wTxg(z)=11+e−z
函数g(z)即为sigmoid函数,其导数形式为:g′(z)=1(1+e−z)2(e−z)=1(1+e−z)(1−1(1+e−z))=g(z)(1−g(z))
logistic模型学习时。对于给定的训练数据集T,可以应用极大似然估计法估计模型参数,从而得到回归模型:
设:p(Y=1|x)=hw(x)p(Y=0|x)=1−hw(x)
似然函数为:L(w)=∏i(hw(xi))yi(1−(hw(xi))1−yi
此处的xi 代表的是第i个样本,对应的yi 是其分类标签
对数似然函数:
要使得最大化,则运用梯度上升法,求出最高点:
计算结果为:
其中
这就验证了代码中:
weights = weights + alpha * dataMatrix.transpose()* error
dataMatrix.transpose()是3*100的矩阵
error是100*1的向量
dataMatrix.transpose()* error得到3*1的向量,即W
其中w1,w2,w3分别是训练数据集中所有样本相应的特征列和差值的积,即
随机梯度上升
# -*- coding: utf-8 -*-"""Created on Fri Aug 18 01:05:58 2017@author: LiLong"""from numpy import *def loadDataSet(): dataMat = []; labelMat = [] fr = open('testSet.txt') for line in fr.readlines(): #strip() 方法用于移除字符串头尾指定的字符(默认为空格) lineArr = line.strip().split() # 第一列是1,另外加的,即形如[1,x1,x2] dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])]) labelMat.append(int(lineArr[2])) return dataMat,labelMat # 此时是数组def sigmoid(inX): return 1.0/(1+exp(-inX)) # 返回一个0-1的100*1的数组# 梯度上升优化算法def gradAscent(dataMatIn, classLabels): dataMatrix = mat(dataMatIn) # 转换成numpy矩阵(matrix),dataMatrix:100*3 labelMat = mat(classLabels).transpose() # transpose转置为100*1 m,n = shape(dataMatrix) # 得到行和列数,0代表行数,1代表列数 print 'shape(dataMatrix):',shape(dataMatrix) alpha = 0.001 # 步长 maxCycles = 500 # 最大迭代次数 weights = ones((n,1)) # 3*1的矩阵,其权系数个数由给定训练集的维度值和类别值决定 print 'weights:',shape(weights) for k in range(maxCycles): h = sigmoid(dataMatrix*weights) # 利用对数几率回归,z = w0 +w1*X1+w2*X2,w0相当于b error = (labelMat - h) # 100*1 weights = weights + alpha * dataMatrix.transpose()* error # 公式推导得到的,一种与推到是极大似然估计 return weights# 随机梯度上升算法def stocGradAscent0(dataMatrix, classLabels): m,n = shape(dataMatrix) # 100*3 alpha = 0.01 weights = ones(n) for i in range(m): # m样本个数 h = sigmoid(sum(dataMatrix[i]*weights)) error = classLabels[i] - h weights = weights + alpha * error * dataMatrix[i] return weights# 改进的随机梯度上升算法def stocGradAscent1(dataMatrix, classLabels, numIter=150): m,n = shape(dataMatrix) weights = ones(n) for j in range(numIter): dataIndex = range(m) # 产生[0,...m]的列表 for i in range(m): alpha = 4/(1.0+j+i)+0.0001 # 每次迭代的时候都会调整 randIndex = int(random.uniform(0,len(dataIndex))) # 随机选择 h = sigmoid(sum(dataMatrix[randIndex]*weights)) error = classLabels[randIndex] - h weights = weights + alpha * error * dataMatrix[randIndex] del(dataIndex[randIndex]) return weights# 画出数据集和Logistic分类直线的函数def plotBestFit(weights): import matplotlib.pyplot as plt #weights=weights.getA() # 随机梯度没有矩阵的转换过程,都是numpy数组 dataMat,labelMat=loadDataSet() # 加载数据集,训练数据和标签 dataArr = array(dataMat) # 进行数据处理,必须转换为numpy中的array n = shape(dataArr)[0] # 样本个数 xcord1 = []; ycord1 = [] xcord2 = []; ycord2 = [] for i in range(n): if int(labelMat[i])== 1: xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2]) # 一类 else: xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2]) # 另一类 fig = plt.figure() ax = fig.add_subplot(111) ax.scatter(xcord1, ycord1, s=30, c='red', marker='s') # 散点图 ax.scatter(xcord2, ycord2, s=30, c='green') x = arange(-3.0, 3.0, 0.1) y = (-weights[0]-weights[1]*x)/weights[2] # 换算后的简化形式 ax.plot(x, y) plt.xlabel('X1'); plt.ylabel('X2') plt.show()#入口函数:模型参数y=wxdataArr,labelMat=loadDataSet() # dataArr列表#weights=gradAscent(dataArr, labelMat) # dataArr:100*3,labelMat是一个列表#plotBestFit(weights)weights=stocGradAscent0(array(dataArr), labelMat) # array()是numpy的数组plotBestFit(weights)weights=stocGradAscent1(array(dataArr), labelMat) # 改进的随机梯度上升,默认150plotBestFit(weights)weights=stocGradAscent1(array(dataArr), labelMat,500) # 改进的随机梯度上升,500次迭代plotBestFit(weights)
运行结果:
随机梯度上升得到的效果,错分了三分之一的样本,但其仅仅把样本遍历了一次,而上面的批处理方法遍历了500次。。
改进的随机梯度上升算法,明显有所提高,改进后的遍历次数默认是150次,也可以改为500次,效果如上两图,其中,改进的地方有:
步长参数不再是固定值,是一个大致减小的趋势,该参数控制新样本的影响程度
随机选择样本,减小了权值系数的周期性的波动
改进的随机梯度上升算法使得收敛快,计算复杂度降低
案例(从疝气病症预测病马的死亡率)
# -*- coding: utf-8 -*-"""Created on Fri Aug 18 01:05:58 2017@author: LiLong"""from numpy import *def sigmoid(inX): return 1.0/(1+exp(-inX)) # 返回一个0-1的100*1的数组# 改进的随机梯度上升算法def stocGradAscent1(dataMatrix, classLabels, numIter=150): m,n = shape(dataMatrix) weights = ones(n) for j in range(numIter): dataIndex = range(m) # 产生[0,...m]的列表 for i in range(m): alpha = 4/(1.0+j+i)+0.0001 # 每次迭代的时候都会调整 randIndex = int(random.uniform(0,len(dataIndex))) # 随机选择 h = sigmoid(sum(dataMatrix[randIndex]*weights)) error = classLabels[randIndex] - h weights = weights + alpha * error * dataMatrix[randIndex] del(dataIndex[randIndex]) return weights# 分类def classifyVector(inX, weights): prob = sigmoid(sum(inX*weights)) if prob > 0.5: return 1.0 else: return 0.0# 数据处理def colicTest(): frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt') trainingSet = []; trainingLabels = [] for line in frTrain.readlines(): currLine = line.strip().split('\t') # 默认为所有的空字符,包括空格、换行(\n)、制表符(\t)等 lineArr =[] for i in range(21): lineArr.append(float(currLine[i])) trainingSet.append(lineArr) trainingLabels.append(float(currLine[21])) trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000) #迭代1000次 errorCount = 0; numTestVec = 0.0 for line in frTest.readlines(): numTestVec += 1.0 currLine = line.strip().split('\t') # \t制表符 lineArr =[] for i in range(21): lineArr.append(float(currLine[i])) if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]): errorCount += 1 errorRate = (float(errorCount)/numTestVec) print "the error rate of this test is: %f" % errorRate return errorRatedef multiTest(): numTests = 10; errorSum=0.0 for k in range(numTests): # 10次之后的平均 errorSum += colicTest() print "after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests))#入口函数multiTest()
运行结果:
runfile('C:/Users/LiLong/Desktop/logistic_regression/logistics_re.py', wdir='C:/Users/LiLong/Desktop/logistic_regression')the error rate of this test is: 0.343284the error rate of this test is: 0.328358the error rate of this test is: 0.298507the error rate of this test is: 0.283582the error rate of this test is: 0.388060the error rate of this test is: 0.343284the error rate of this test is: 0.328358the error rate of this test is: 0.432836the error rate of this test is: 0.268657the error rate of this test is: 0.358209after 10 iterations the average error rate is: 0.337313
可以看出错误率为34%,可以通过调整迭代次数和步长来降低错误率,而数据的预处理,可以参考机器学习实战。。。
- Logistic回归
- Logistic回归
- logistic回归
- Logistic 回归
- Logistic 回归
- logistic回归
- Logistic回归
- Logistic回归
- Logistic回归
- Logistic回归
- logistic回归
- Logistic回归
- logistic回归
- Logistic 回归
- logistic回归
- Logistic回归
- logistic回归
- Logistic回归
- 傅立叶变换详解
- hadoop 和spark的基准测试(1)
- common errors while programming CUDA
- linux上anaconda的卸载
- 贪吃蛇源代码
- Logistic回归
- jsp页面转发到servlet
- 自定义软键盘,随机数字位置键盘
- C#中I/O操作
- UGUI研究院之新手引导事件上下分离
- Android 常用测试接口 视频,天气等
- 51nod 1605-棋盘游戏(博弈)
- 自学nginx(一): nginx的快速安裝
- spark-streaming 编程(二) word count单词计数统计