Logistic回归

来源:互联网 发布:淘宝商城儿童睡袋 编辑:程序博客网 时间:2024/06/01 12:58

先不说算法基础,直接干。。。。

使用梯度上升找到最佳参数

# -*- coding: utf-8 -*-"""Created on Fri Aug 18 01:05:58 2017@author: LiLong"""from numpy import *def loadDataSet():    dataMat = []; labelMat = []    fr = open('testSet.txt')    for line in fr.readlines():        #strip() 方法用于移除字符串头尾指定的字符(默认为空格)        lineArr = line.strip().split()        # 第一列是1,另外加的,即形如[1,x1,x2]        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])         labelMat.append(int(lineArr[2]))    return dataMat,labelMat  # 此时是数组def sigmoid(inX):    return 1.0/(1+exp(-inX)) # 返回一个0-1的100*1的数组# 梯度上升优化算法def gradAscent(dataMatIn, classLabels):    dataMatrix = mat(dataMatIn)       # 转换成numpy矩阵(matrix),dataMatrix:100*3    labelMat = mat(classLabels).transpose()  # transpose转置为100*1    m,n = shape(dataMatrix) # 得到行和列数,0代表行数,1代表列数    print 'shape(dataMatrix):',shape(dataMatrix)     alpha = 0.001  # 步长    maxCycles = 500  # 最大迭代次数    weights = ones((n,1)) # 3*1的矩阵,其权系数个数由给定训练集的维度值和类别值决定    print 'weights:',shape(weights)    for k in range(maxCycles):                      h = sigmoid(dataMatrix*weights)    # 利用对数几率回归,z = w0 +w1*X1+w2*X2,w0相当于b        error = (labelMat - h)  # 100*1          weights = weights + alpha * dataMatrix.transpose()* error # 一种极大似然估计的推到过程,但每次的权系数更新都要遍历整个数据集    return weights# 画出数据集和Logistic分类直线的函数def plotBestFit(weights):    import matplotlib.pyplot as plt    weights=weights.getA() # 必须转换为array    dataMat,labelMat=loadDataSet() # 加载数据集,训练数据和标签    dataArr = array(dataMat)    # 进行数据处理,必须转换为numpy中的array    n = shape(dataArr)[0]   # 样本个数    xcord1 = []; ycord1 = []    xcord2 = []; ycord2 = []    for i in range(n):        if int(labelMat[i])== 1:            xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2]) # 一类        else:            xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2]) # 另一类    fig = plt.figure()    ax = fig.add_subplot(111)    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s') # 散点图    ax.scatter(xcord2, ycord2, s=30, c='green')    x = arange(-3.0, 3.0, 0.1)    y = (-weights[0]-weights[1]*x)/weights[2]  # 换算后的简化形式    ax.plot(x, y)    plt.xlabel('X1'); plt.ylabel('X2')    plt.show()#入口函数:模型参数y=wxdataArr,labelMat=loadDataSet()weights=gradAscent(dataArr, labelMat) # dataArr:100*3,labelMat是一个列表plotBestFit(weights)

运行结果:
这里写图片描述

首先说一点语法基础,免得代码调半天:

  • numpy中的ones( )用法注意:
>>> ones(10)array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])>>> ones((10,1))array([[ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.]])>>> ones(10,1)Traceback (most recent call last):  File "<stdin>", line 1, in <module>  File "C:\Program_software\Anaconda\lib\site-packages\numpy\core\numeric.py", line 190, in ones    a = empty(shape, dtype, order)TypeError: data type not understood>>> ones([10,1])array([[ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.]])>>> 
  • numpy.matrix.getA()用法注意:

matrix.getA()

Return self as an ndarray object.
Equivalent to np.asarray(self).

Parameters: None
Returns: ret : ndarray

weightsOut[16]: matrix([[ 4.12414349],        [ 0.48007329],        [-0.6168482 ]])weights.getA()Out[17]: array([[ 4.12414349],       [ 0.48007329],       [-0.6168482 ]])

必须转换为array,weights=weights.getA(),不转换的话回归直线出不来。。。

总结实现过程中遇到的问题:

  • 在二项logistic回归中,为了方便会将权值向量和输入向量加以扩充,一般的形式比如:

y=WTX+b,为了方便会扩充成

W=(w(1),w(2),w(3),,w(n),b)TX=(x(1),x(2),x(3),,x(n),1)T

这时,y=WTX

在上面的代码中,是把w0当作b,即:

dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])]) 

输入数据的第一列全部是1,dataMat的第二,第三列是特征特征x2,x3,每一行形如[1,x2,x3]。
所以,代码中

h = sigmoid(dataMatrix*weights)

运算的是

[b,w1,w2]1x2x3

  • 算法用的是logistic回归
    hw(x)=g(wTx)=11+ewTxg(z)=11+ez

    函数g(z)即为sigmoid函数,其导数形式为:
    g(z)=1(1+ez)2(ez)=1(1+ez)(11(1+ez))=g(z)(1g(z))

    logistic模型学习时。对于给定的训练数据集T,可以应用极大似然估计法估计模型参数,从而得到回归模型:
    设:
    p(Y=1|x)=hw(x)p(Y=0|x)=1hw(x)

    似然函数为:
    L(w)=i(hw(xi))yi(1(hw(xi))1yi

    此处的xi代表的是第i个样本,对应的yi是其分类标签

对数似然函数:

δ(w)=logL(w)=i(yilogh(xi)+(1yi)log(1h(xi))

要使得最大化,则运用梯度上升法,求出最高点:

w:=w+αwδ(w)wδ(w)=wjδ(w)
z=wTx

wjδ(w)=i(yi1g(wTxi)(1yi)11g(wTxi))wjg(wTxi)=i(yi1g(wTxi)(1yi)11g(wTxi))g(wTxi)(1g(wTxi))wjwTxi=i(yi(1g(wTxi))(1yi)g(wTxi))xj=i(yihw(xi))xj

计算结果为:
wj:=wj+αi(yihw(xi))xij

其中xij的 i 是第 i 个样本,j 是第 j 个特征
这就验证了代码中:

weights = weights + alpha * dataMatrix.transpose()* error

dataMatrix.transpose()是3*100的矩阵
error是100*1的向量
dataMatrix.transpose()* error得到3*1的向量,即W
其中w1,w2,w3分别是训练数据集中所有样本相应的特征列和差值的积,即
w1=[x11x21x1001]e0e1e99

w2=[x12x22x1002]e0e1e99

w3=[x13x23x1003]e0e1e99

随机梯度上升

# -*- coding: utf-8 -*-"""Created on Fri Aug 18 01:05:58 2017@author: LiLong"""from numpy import *def loadDataSet():    dataMat = []; labelMat = []    fr = open('testSet.txt')    for line in fr.readlines():        #strip() 方法用于移除字符串头尾指定的字符(默认为空格)        lineArr = line.strip().split()        # 第一列是1,另外加的,即形如[1,x1,x2]        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])         labelMat.append(int(lineArr[2]))    return dataMat,labelMat  # 此时是数组def sigmoid(inX):    return 1.0/(1+exp(-inX)) # 返回一个0-1的100*1的数组# 梯度上升优化算法def gradAscent(dataMatIn, classLabels):    dataMatrix = mat(dataMatIn)       # 转换成numpy矩阵(matrix),dataMatrix:100*3    labelMat = mat(classLabels).transpose()  # transpose转置为100*1    m,n = shape(dataMatrix) # 得到行和列数,0代表行数,1代表列数    print 'shape(dataMatrix):',shape(dataMatrix)     alpha = 0.001  # 步长    maxCycles = 500  # 最大迭代次数    weights = ones((n,1)) # 3*1的矩阵,其权系数个数由给定训练集的维度值和类别值决定    print 'weights:',shape(weights)    for k in range(maxCycles):                      h = sigmoid(dataMatrix*weights)    # 利用对数几率回归,z = w0 +w1*X1+w2*X2,w0相当于b        error = (labelMat - h)  # 100*1          weights = weights + alpha * dataMatrix.transpose()* error # 公式推导得到的,一种与推到是极大似然估计    return weights# 随机梯度上升算法def stocGradAscent0(dataMatrix, classLabels):    m,n = shape(dataMatrix) # 100*3    alpha = 0.01    weights = ones(n)    for i in range(m): # m样本个数        h = sigmoid(sum(dataMatrix[i]*weights))        error = classLabels[i] - h        weights = weights + alpha * error * dataMatrix[i]    return weights# 改进的随机梯度上升算法def stocGradAscent1(dataMatrix, classLabels, numIter=150):    m,n = shape(dataMatrix)    weights = ones(n)      for j in range(numIter):        dataIndex = range(m)  # 产生[0,...m]的列表        for i in range(m):            alpha = 4/(1.0+j+i)+0.0001    # 每次迭代的时候都会调整            randIndex = int(random.uniform(0,len(dataIndex))) # 随机选择            h = sigmoid(sum(dataMatrix[randIndex]*weights))            error = classLabels[randIndex] - h            weights = weights + alpha * error * dataMatrix[randIndex]            del(dataIndex[randIndex])    return weights# 画出数据集和Logistic分类直线的函数def plotBestFit(weights):    import matplotlib.pyplot as plt    #weights=weights.getA()   # 随机梯度没有矩阵的转换过程,都是numpy数组    dataMat,labelMat=loadDataSet() # 加载数据集,训练数据和标签    dataArr = array(dataMat)    # 进行数据处理,必须转换为numpy中的array    n = shape(dataArr)[0]   # 样本个数    xcord1 = []; ycord1 = []    xcord2 = []; ycord2 = []    for i in range(n):        if int(labelMat[i])== 1:            xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2]) # 一类        else:            xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2]) # 另一类    fig = plt.figure()    ax = fig.add_subplot(111)    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s') # 散点图    ax.scatter(xcord2, ycord2, s=30, c='green')    x = arange(-3.0, 3.0, 0.1)    y = (-weights[0]-weights[1]*x)/weights[2]  # 换算后的简化形式    ax.plot(x, y)    plt.xlabel('X1'); plt.ylabel('X2')    plt.show()#入口函数:模型参数y=wxdataArr,labelMat=loadDataSet() # dataArr列表#weights=gradAscent(dataArr, labelMat) # dataArr:100*3,labelMat是一个列表#plotBestFit(weights)weights=stocGradAscent0(array(dataArr), labelMat) # array()是numpy的数组plotBestFit(weights)weights=stocGradAscent1(array(dataArr), labelMat) # 改进的随机梯度上升,默认150plotBestFit(weights)weights=stocGradAscent1(array(dataArr), labelMat,500) # 改进的随机梯度上升,500次迭代plotBestFit(weights)

运行结果:
这里写图片描述

随机梯度上升得到的效果,错分了三分之一的样本,但其仅仅把样本遍历了一次,而上面的批处理方法遍历了500次。。

这里写图片描述

这里写图片描述
改进的随机梯度上升算法,明显有所提高,改进后的遍历次数默认是150次,也可以改为500次,效果如上两图,其中,改进的地方有:

  • 步长参数不再是固定值,是一个大致减小的趋势,该参数控制新样本的影响程度

  • 随机选择样本,减小了权值系数的周期性的波动

  • 改进的随机梯度上升算法使得收敛快,计算复杂度降低

案例(从疝气病症预测病马的死亡率)

# -*- coding: utf-8 -*-"""Created on Fri Aug 18 01:05:58 2017@author: LiLong"""from numpy import *def sigmoid(inX):    return 1.0/(1+exp(-inX)) # 返回一个0-1的100*1的数组# 改进的随机梯度上升算法def stocGradAscent1(dataMatrix, classLabels, numIter=150):    m,n = shape(dataMatrix)    weights = ones(n)      for j in range(numIter):        dataIndex = range(m)  # 产生[0,...m]的列表        for i in range(m):            alpha = 4/(1.0+j+i)+0.0001    # 每次迭代的时候都会调整            randIndex = int(random.uniform(0,len(dataIndex))) # 随机选择            h = sigmoid(sum(dataMatrix[randIndex]*weights))            error = classLabels[randIndex] - h            weights = weights + alpha * error * dataMatrix[randIndex]            del(dataIndex[randIndex])    return weights# 分类def classifyVector(inX, weights):    prob = sigmoid(sum(inX*weights))    if prob > 0.5: return 1.0    else: return 0.0# 数据处理def colicTest():    frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt')    trainingSet = []; trainingLabels = []    for line in frTrain.readlines():        currLine = line.strip().split('\t') # 默认为所有的空字符,包括空格、换行(\n)、制表符(\t)等        lineArr =[]        for i in range(21):            lineArr.append(float(currLine[i]))        trainingSet.append(lineArr)        trainingLabels.append(float(currLine[21]))    trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000) #迭代1000次    errorCount = 0; numTestVec = 0.0    for line in frTest.readlines():        numTestVec += 1.0        currLine = line.strip().split('\t') # \t制表符        lineArr =[]        for i in range(21):            lineArr.append(float(currLine[i]))        if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]):            errorCount += 1    errorRate = (float(errorCount)/numTestVec)    print "the error rate of this test is: %f" % errorRate    return errorRatedef multiTest():    numTests = 10; errorSum=0.0    for k in range(numTests): # 10次之后的平均        errorSum += colicTest()    print "after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests))#入口函数multiTest()

运行结果:

runfile('C:/Users/LiLong/Desktop/logistic_regression/logistics_re.py', wdir='C:/Users/LiLong/Desktop/logistic_regression')the error rate of this test is: 0.343284the error rate of this test is: 0.328358the error rate of this test is: 0.298507the error rate of this test is: 0.283582the error rate of this test is: 0.388060the error rate of this test is: 0.343284the error rate of this test is: 0.328358the error rate of this test is: 0.432836the error rate of this test is: 0.268657the error rate of this test is: 0.358209after 10 iterations the average error rate is: 0.337313

可以看出错误率为34%,可以通过调整迭代次数和步长来降低错误率,而数据的预处理,可以参考机器学习实战。。。

原创粉丝点击