Logistic回归

来源：互联网发布：淘宝商城儿童睡袋编辑：程序博客网时间：2024/06/01 12:58

先不说算法基础，直接干。。。。

使用梯度上升找到最佳参数

# -*- coding: utf-8 -*-"""Created on Fri Aug 18 01:05:58 2017@author: LiLong"""from numpy import *def loadDataSet():    dataMat = []; labelMat = []    fr = open('testSet.txt')    for line in fr.readlines():        #strip() 方法用于移除字符串头尾指定的字符（默认为空格）        lineArr = line.strip().split()        # 第一列是1，另外加的，即形如[1,x1,x2]        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])         labelMat.append(int(lineArr[2]))    return dataMat,labelMat  # 此时是数组def sigmoid(inX):    return 1.0/(1+exp(-inX)) # 返回一个0-1的100*1的数组# 梯度上升优化算法def gradAscent(dataMatIn, classLabels):    dataMatrix = mat(dataMatIn)       # 转换成numpy矩阵(matrix),dataMatrix:100*3    labelMat = mat(classLabels).transpose()  # transpose转置为100*1    m,n = shape(dataMatrix) # 得到行和列数，0代表行数，1代表列数    print 'shape(dataMatrix):',shape(dataMatrix)     alpha = 0.001  # 步长    maxCycles = 500  # 最大迭代次数    weights = ones((n,1)) # 3*1的矩阵，其权系数个数由给定训练集的维度值和类别值决定    print 'weights:',shape(weights)    for k in range(maxCycles):                      h = sigmoid(dataMatrix*weights)    # 利用对数几率回归，z = w0 +w1*X1+w2*X2，w0相当于b        error = (labelMat - h)  # 100*1          weights = weights + alpha * dataMatrix.transpose()* error # 一种极大似然估计的推到过程，但每次的权系数更新都要遍历整个数据集    return weights# 画出数据集和Logistic分类直线的函数def plotBestFit(weights):    import matplotlib.pyplot as plt    weights=weights.getA() # 必须转换为array    dataMat,labelMat=loadDataSet() # 加载数据集，训练数据和标签    dataArr = array(dataMat)    # 进行数据处理，必须转换为numpy中的array    n = shape(dataArr)[0]   # 样本个数    xcord1 = []; ycord1 = []    xcord2 = []; ycord2 = []    for i in range(n):        if int(labelMat[i])== 1:            xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2]) # 一类        else:            xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2]) # 另一类    fig = plt.figure()    ax = fig.add_subplot(111)    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s') # 散点图    ax.scatter(xcord2, ycord2, s=30, c='green')    x = arange(-3.0, 3.0, 0.1)    y = (-weights[0]-weights[1]*x)/weights[2]  # 换算后的简化形式    ax.plot(x, y)    plt.xlabel('X1'); plt.ylabel('X2')    plt.show()#入口函数:模型参数y=wxdataArr,labelMat=loadDataSet()weights=gradAscent(dataArr, labelMat) # dataArr:100*3,labelMat是一个列表plotBestFit(weights)

运行结果：
这里写图片描述

首先说一点语法基础，免得代码调半天：

numpy中的ones( )用法注意：

>>> ones(10)array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])>>> ones((10,1))array([[ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.]])>>> ones(10,1)Traceback (most recent call last):  File "<stdin>", line 1, in <module>  File "C:\Program_software\Anaconda\lib\site-packages\numpy\core\numeric.py", line 190, in ones    a = empty(shape, dtype, order)TypeError: data type not understood>>> ones([10,1])array([[ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.],       [ 1.]])>>>

numpy.matrix.getA（）用法注意：

matrix.getA()

Return self as an ndarray object.
Equivalent to np.asarray(self).

Parameters: None
Returns: ret : ndarray

weightsOut[16]: matrix([[ 4.12414349],        [ 0.48007329],        [-0.6168482 ]])weights.getA()Out[17]: array([[ 4.12414349],       [ 0.48007329],       [-0.6168482 ]])

必须转换为array，weights=weights.getA()，不转换的话回归直线出不来。。。

总结实现过程中遇到的问题：

在二项logistic回归中，为了方便会将权值向量和输入向量加以扩充，一般的形式比如：

y=WTX+b，为了方便会扩充成

W = (w (1), w (2), w (3), \dots, w (n), b) T X = (x (1), x (2), x (3), \dots, x (n), 1) T

这时，

y=WTX

在上面的代码中，是把w0当作b，即：

dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])

输入数据的第一列全部是1，dataMat的第二，第三列是特征特征x2,x3，每一行形如[1,x2,x3]。
所以，代码中

h = sigmoid(dataMatrix*weights)

运算的是

[b, w 1, w 2] * ⎡ ⎣ ⎢ 1 x 2 x 3 ⎤ ⎦ ⎥

算法用的是logistic回归：
即 $h w (x) = g (w T x) = 1 1 + e - w T x g (z) = 1 1 + e - z$
函数g(z)即为sigmoid函数，其导数形式为：
$g' (z) = 1 ( 1 + e - z ) 2 (e - z) = 1 ( 1 + e - z ) (1 - 1 ( 1 + e - z )) = g (z) (1 - g (z))$
logistic模型学习时。对于给定的训练数据集T，可以应用极大似然估计法估计模型参数，从而得到回归模型：
设: $p (Y = 1 | x) = h w (x) p (Y = 0 | x) = 1 - h w (x)$
似然函数为： $L (w) = \prod i (h w (x i)) y i (1 - (h w (x i)) 1 - y i$
此处的xi代表的是第i个样本，对应的yi是其分类标签

对数似然函数：

δ (w) = l o g L (w) = \sum i (y i l o g h (x i) + (1 - y i) l o g (1 - h (x i))

要使得最大化，则运用梯度上升法，求出最高点：

w:=w+α▽wδ(w)▽wδ(w)=∂∂wjδ(w)
z=wTx

\partial \partial w j δ (w) = \sum i (y i 1 g ( w T x i ) - (1 - y i) 1 1 - g ( w T x i )) \partial \partial w j g (w T x i) = \sum i (y i 1 g ( w T x i ) - (1 - y i) 1 1 - g ( w T x i )) g (w T x i) (1 - g (w T x i)) \partial \partial w j w T x i = \sum i (y i (1 - g (w T x i)) - (1 - y i) g (w T x i)) x j = \sum i (y i - h w (x i)) x j

计算结果为：

w j : = w j + α \sum i (y i - h w (x i)) x i j

其中

xij的 i 是第 i 个样本，j 是第 j 个特征
这就验证了代码中：

weights = weights + alpha * dataMatrix.transpose()* error

dataMatrix.transpose()是3*100的矩阵
error是100*1的向量
dataMatrix.transpose()* error得到3*1的向量，即W
其中w1,w2,w3分别是训练数据集中所有样本相应的特征列和差值的积，即
w1=[x11x21⋯x1001]⎡⎣⎢⎢⎢⎢e0e1⋮e99⎤⎦⎥⎥⎥⎥

w2=[x12x22⋯x1002]⎡⎣⎢⎢⎢⎢e0e1⋮e99⎤⎦⎥⎥⎥⎥

w3=[x13x23⋯x1003]⎡⎣⎢⎢⎢⎢e0e1⋮e99⎤⎦⎥⎥⎥⎥

随机梯度上升

# -*- coding: utf-8 -*-"""Created on Fri Aug 18 01:05:58 2017@author: LiLong"""from numpy import *def loadDataSet():    dataMat = []; labelMat = []    fr = open('testSet.txt')    for line in fr.readlines():        #strip() 方法用于移除字符串头尾指定的字符（默认为空格）        lineArr = line.strip().split()        # 第一列是1，另外加的，即形如[1,x1,x2]        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])         labelMat.append(int(lineArr[2]))    return dataMat,labelMat  # 此时是数组def sigmoid(inX):    return 1.0/(1+exp(-inX)) # 返回一个0-1的100*1的数组# 梯度上升优化算法def gradAscent(dataMatIn, classLabels):    dataMatrix = mat(dataMatIn)       # 转换成numpy矩阵(matrix),dataMatrix:100*3    labelMat = mat(classLabels).transpose()  # transpose转置为100*1    m,n = shape(dataMatrix) # 得到行和列数，0代表行数，1代表列数    print 'shape(dataMatrix):',shape(dataMatrix)     alpha = 0.001  # 步长    maxCycles = 500  # 最大迭代次数    weights = ones((n,1)) # 3*1的矩阵，其权系数个数由给定训练集的维度值和类别值决定    print 'weights:',shape(weights)    for k in range(maxCycles):                      h = sigmoid(dataMatrix*weights)    # 利用对数几率回归，z = w0 +w1*X1+w2*X2，w0相当于b        error = (labelMat - h)  # 100*1          weights = weights + alpha * dataMatrix.transpose()* error # 公式推导得到的，一种与推到是极大似然估计    return weights# 随机梯度上升算法def stocGradAscent0(dataMatrix, classLabels):    m,n = shape(dataMatrix) # 100*3    alpha = 0.01    weights = ones(n)    for i in range(m): # m样本个数        h = sigmoid(sum(dataMatrix[i]*weights))        error = classLabels[i] - h        weights = weights + alpha * error * dataMatrix[i]    return weights# 改进的随机梯度上升算法def stocGradAscent1(dataMatrix, classLabels, numIter=150):    m,n = shape(dataMatrix)    weights = ones(n)      for j in range(numIter):        dataIndex = range(m)  # 产生[0,...m]的列表        for i in range(m):            alpha = 4/(1.0+j+i)+0.0001    # 每次迭代的时候都会调整            randIndex = int(random.uniform(0,len(dataIndex))) # 随机选择            h = sigmoid(sum(dataMatrix[randIndex]*weights))            error = classLabels[randIndex] - h            weights = weights + alpha * error * dataMatrix[randIndex]            del(dataIndex[randIndex])    return weights# 画出数据集和Logistic分类直线的函数def plotBestFit(weights):    import matplotlib.pyplot as plt    #weights=weights.getA()   # 随机梯度没有矩阵的转换过程，都是numpy数组    dataMat,labelMat=loadDataSet() # 加载数据集，训练数据和标签    dataArr = array(dataMat)    # 进行数据处理，必须转换为numpy中的array    n = shape(dataArr)[0]   # 样本个数    xcord1 = []; ycord1 = []    xcord2 = []; ycord2 = []    for i in range(n):        if int(labelMat[i])== 1:            xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2]) # 一类        else:            xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2]) # 另一类    fig = plt.figure()    ax = fig.add_subplot(111)    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s') # 散点图    ax.scatter(xcord2, ycord2, s=30, c='green')    x = arange(-3.0, 3.0, 0.1)    y = (-weights[0]-weights[1]*x)/weights[2]  # 换算后的简化形式    ax.plot(x, y)    plt.xlabel('X1'); plt.ylabel('X2')    plt.show()#入口函数:模型参数y=wxdataArr,labelMat=loadDataSet() # dataArr列表#weights=gradAscent(dataArr, labelMat) # dataArr:100*3,labelMat是一个列表#plotBestFit(weights)weights=stocGradAscent0(array(dataArr), labelMat) # array()是numpy的数组plotBestFit(weights)weights=stocGradAscent1(array(dataArr), labelMat) # 改进的随机梯度上升，默认150plotBestFit(weights)weights=stocGradAscent1(array(dataArr), labelMat,500) # 改进的随机梯度上升，500次迭代plotBestFit(weights)

运行结果：
这里写图片描述

随机梯度上升得到的效果，错分了三分之一的样本，但其仅仅把样本遍历了一次，而上面的批处理方法遍历了500次。。

这里写图片描述

这里写图片描述
改进的随机梯度上升算法，明显有所提高，改进后的遍历次数默认是150次，也可以改为500次，效果如上两图，其中，改进的地方有：

步长参数不再是固定值，是一个大致减小的趋势，该参数控制新样本的影响程度
随机选择样本，减小了权值系数的周期性的波动
改进的随机梯度上升算法使得收敛快，计算复杂度降低

案例（从疝气病症预测病马的死亡率）

# -*- coding: utf-8 -*-"""Created on Fri Aug 18 01:05:58 2017@author: LiLong"""from numpy import *def sigmoid(inX):    return 1.0/(1+exp(-inX)) # 返回一个0-1的100*1的数组# 改进的随机梯度上升算法def stocGradAscent1(dataMatrix, classLabels, numIter=150):    m,n = shape(dataMatrix)    weights = ones(n)      for j in range(numIter):        dataIndex = range(m)  # 产生[0,...m]的列表        for i in range(m):            alpha = 4/(1.0+j+i)+0.0001    # 每次迭代的时候都会调整            randIndex = int(random.uniform(0,len(dataIndex))) # 随机选择            h = sigmoid(sum(dataMatrix[randIndex]*weights))            error = classLabels[randIndex] - h            weights = weights + alpha * error * dataMatrix[randIndex]            del(dataIndex[randIndex])    return weights# 分类def classifyVector(inX, weights):    prob = sigmoid(sum(inX*weights))    if prob > 0.5: return 1.0    else: return 0.0# 数据处理def colicTest():    frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt')    trainingSet = []; trainingLabels = []    for line in frTrain.readlines():        currLine = line.strip().split('\t') # 默认为所有的空字符，包括空格、换行(\n)、制表符(\t)等        lineArr =[]        for i in range(21):            lineArr.append(float(currLine[i]))        trainingSet.append(lineArr)        trainingLabels.append(float(currLine[21]))    trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000) #迭代1000次    errorCount = 0; numTestVec = 0.0    for line in frTest.readlines():        numTestVec += 1.0        currLine = line.strip().split('\t') # \t制表符        lineArr =[]        for i in range(21):            lineArr.append(float(currLine[i]))        if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]):            errorCount += 1    errorRate = (float(errorCount)/numTestVec)    print "the error rate of this test is: %f" % errorRate    return errorRatedef multiTest():    numTests = 10; errorSum=0.0    for k in range(numTests): # 10次之后的平均        errorSum += colicTest()    print "after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests))#入口函数multiTest()

运行结果：

runfile('C:/Users/LiLong/Desktop/logistic_regression/logistics_re.py', wdir='C:/Users/LiLong/Desktop/logistic_regression')the error rate of this test is: 0.343284the error rate of this test is: 0.328358the error rate of this test is: 0.298507the error rate of this test is: 0.283582the error rate of this test is: 0.388060the error rate of this test is: 0.343284the error rate of this test is: 0.328358the error rate of this test is: 0.432836the error rate of this test is: 0.268657the error rate of this test is: 0.358209after 10 iterations the average error rate is: 0.337313

可以看出错误率为34%，可以通过调整迭代次数和步长来降低错误率，而数据的预处理，可以参考机器学习实战。。。

阅读全文

2 0