深度学习算法实践6---逻辑回归算法应用

来源：互联网发布：湖州公安网络报警网站编辑：程序博客网时间：2024/04/27 15:58

在上篇博文中，我们介绍了深度学习算法的实现，并且以MNIST手写数字识别为例，验证了该算法的有效性。

但是我们学习逻辑回归算法的目的是解决我们的实际问题，而不是学习算法本身。逻辑回归算法在实际中的应用还是很广泛的，例如在医学领域的疾病预测中，我们就可以列出一系疾病相关因素，然后根据某位患者的具体情况，应用逻辑回归算法，判断该患者是否患有某种疾病。当然，逻辑回归算法还是有局限性的，其比较适合于处理线性可分的分类问题，但是对于线性不可分的分类问题，这种算法的价值就会大打折扣了。但是我们可以将逻辑回归算法，视为没有隐藏层的前馈网络，通过增加隐藏层，就可以处理各种线性不可分问题了。借助于Theano的框架，在后面博文中我们会介绍BP网络、多层卷积网络（LeNet），大家可以看到，在Theano中，实现这些模型是一件非常简单的事情。

言归正传，如果我们要用逻辑回归算法解决实际问题，我们主要需要改变的就是load_data函数，使其从我们规定的数据源中读取数据。在此，我们先设计一个训练数据读入的工具类SegLoader，文件名为seg_loader.py，代码如下所示：

[python] view plain copy
 
from __future__ import print_function  
  
__docformat__ = 'restructedtext en'  
  
import six.moves.cPickle as pickle  
import gzip  
import os  
import sys  
import timeit  
  
import numpy  
  
import theano  
import theano.tensor as T  
  
class SegLoader(object):  
    def load_data(self, dataset):  
        samplesNumber = 6  
        features = 2  
        train_set = (numpy.ndarray(shape=(samplesNumber, features), dtype=numpy.float32), numpy.ndarray(shape=(samplesNumber), dtype=int))  
        self.prepare_dataset(train_set)  
        valid_set = (train_set[0].copy(), train_set[1].copy())  
        test_set = (train_set[0].copy(), train_set[1].copy())  
        test_set_x, test_set_y = self.shared_dataset(test_set)  
        valid_set_x, valid_set_y = self.shared_dataset(valid_set)  
        train_set_x, train_set_y = self.shared_dataset(train_set)  
        rval = [(train_set_x, train_set_y), (valid_set_x, valid_set_y),  
            (test_set_x, test_set_y)]  
        return rval  
  
    def shared_dataset(self, data_xy, borrow=True):  
        data_x, data_y = data_xy  
        shared_x = theano.shared(numpy.asarray(data_x,  
                                               dtype=theano.config.floatX),  
                                 borrow=borrow)  
        shared_y = theano.shared(numpy.asarray(data_y,  
                                               dtype=theano.config.floatX),  
                                 borrow=borrow)  
        return shared_x, T.cast(shared_y, 'int32')  
  
    def prepare_dataset(self, dataset):  
        dataset[0][0][0] = 1.0  
        dataset[0][0][1] = 1.0  
        dataset[1][0] = 1  
  
        dataset[0][1][0] = 2.0  
        dataset[0][1][1] = 2.0  
        dataset[1][1] = 1  
  
        dataset[0][2][0] = 3.0  
        dataset[0][2][1] = 3.0  
        dataset[1][2] = 1  
  
        dataset[0][3][0] = 1.5  
        dataset[0][3][1] = 2.0  
        dataset[1][3] = 0  
  
        dataset[0][4][0] = 2.5  
        dataset[0][4][1] = 4.0  
        dataset[1][4] = 0  
  
        dataset[0][5][0] = 3.5  
        dataset[0][5][1] = 7.0  
        dataset[1][5] = 0  

上面的代码非常简单，生成一个元组train_set，包含两个元素，第一个元素是一个类型为float32的二维数组，每行代表一个样本，第一列代表X坐标，第二列代表Y坐标，train_set元组的第二个元素为一维整数数组，每个元素代表一个样本的分类结果，这里有两个大类，1代表在Y=X的直线上，0代表不在该直线上，prepare_dataset准备了6个训练样。因为这个问题非常简单，所以6个样本基本就够用了，但是对实际问题而言，显然需要相当大的样本量。

接着我们定义这个线性分割的执行引擎LrSegEngine，源码文件为lr_seg_engine.py，代码如下所示：

[python] view plain copy
 
from __future__ import print_function  
  
__docformat__ = 'restructedtext en'  
  
import six.moves.cPickle as pickle  
import gzip  
import os  
import sys  
import timeit  
  
import numpy  
  
import theano  
import theano.tensor as T  
from logistic_regression import LogisticRegression  
from seg_loader import SegLoader  
  
class LrSegEngine(object):  
    def __init__(self):  
        print("Logistic Regression MNIST Engine")  
        self.learning_rate = 0.13  
        self.n_epochs = 1000  
        self.batch_size = 1 # 600  
        self.dataset = 'mnist.pkl.gz'  
  
    def train(self):  
        print("Yantao:train the model")  
        loader = SegLoader()  
        datasets = loader.load_data(self.dataset)  
        train_set_x, train_set_y = datasets[0]  
        valid_set_x, valid_set_y = datasets[1]  
        test_set_x, test_set_y = datasets[2]  
        n_train_batches = train_set_x.get_value(borrow=True).shape[0] // self.batch_size  
        n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] // self.batch_size  
        n_test_batches = test_set_x.get_value(borrow=True).shape[0] // self.batch_size  
        index = T.lscalar()  
        x = T.matrix('x')  
        y = T.ivector('y')  
        # in:x,y out: 1 in y=x otherwise 0  
        classifier = LogisticRegression(input=x, n_in=2, n_out=2)  
        cost = classifier.negative_log_likelihood(y)  
        test_model = theano.function(  
            inputs=[index],  
            outputs=classifier.errors(y),  
            givens={  
                x: test_set_x[index * self.batch_size: (index + 1) * self.batch_size],  
                y: test_set_y[index * self.batch_size: (index + 1) * self.batch_size]  
            }  
        )  
        validate_model = theano.function(  
            inputs=[index],  
            outputs=classifier.errors(y),  
            givens={  
                x: valid_set_x[index * self.batch_size: (index + 1) * self.batch_size],  
                y: valid_set_y[index * self.batch_size: (index + 1) * self.batch_size]  
            }  
        )  
        g_W = T.grad(cost=cost, wrt=classifier.W)  
        g_b = T.grad(cost=cost, wrt=classifier.b)  
        updates = [(classifier.W, classifier.W - self.learning_rate * g_W),  
               (classifier.b, classifier.b - self.learning_rate * g_b)]  
        train_model = theano.function(  
            inputs=[index],  
            outputs=cost,  
            updates=updates,  
            givens={  
                x: train_set_x[index * self.batch_size: (index + 1) * self.batch_size],  
                y: train_set_y[index * self.batch_size: (index + 1) * self.batch_size]  
            }  
        )  
        patience = 5000    
        patience_increase = 2    
        improvement_threshold = 0.995    
        validation_frequency = min(n_train_batches, patience // 2)  
        best_validation_loss = numpy.inf  
        test_score = 0.  
        start_time = timeit.default_timer()  
        done_looping = False  
        epoch = 0  
        while (epoch < self.n_epochs) and (not done_looping):  
            epoch = epoch + 1  
            for minibatch_index in range(n_train_batches):  
                minibatch_avg_cost = train_model(minibatch_index)  
                # iteration number  
                iter = (epoch - 1) * n_train_batches + minibatch_index  
                if (iter + 1) % validation_frequency == 0:  
                    # compute zero-one loss on validation set  
                    validation_losses = [validate_model(i)  
                                     for i in range(n_valid_batches)]  
                    this_validation_loss = numpy.mean(validation_losses)  
                    print(  
                        'epoch %i, minibatch %i/%i, validation error %f %%' %  
                        (  
                            epoch,  
                            minibatch_index + 1,  
                            n_train_batches,  
                            this_validation_loss * 100.  
                        )  
                    )  
                    if this_validation_loss < best_validation_loss:  
                        #improve patience if loss improvement is good enough  
                        if this_validation_loss < best_validation_loss * improvement_threshold:  
                            patience = max(patience, iter * patience_increase)  
                        best_validation_loss = this_validation_loss  
                        # test it on the test set  
                        test_losses = [test_model(i)  
                                   for i in range(n_test_batches)]  
                        test_score = numpy.mean(test_losses)  
                        print(  
                            (  
                                '     epoch %i, minibatch %i/%i, test error of'  
                                ' best model %f %%'  
                            ) %  
                            (  
                                epoch,  
                                minibatch_index + 1,  
                                n_train_batches,  
                                test_score * 100.  
                            )  
                        )  
                        # save the best model  
                        with open('best_model.pkl', 'wb') as f:  
                            pickle.dump(classifier, f)  
                if patience <= iter:  
                    done_looping = True  
                    break  
        end_time = timeit.default_timer()  
        print(  
            (  
                'Optimization complete with best validation score of %f %%,'  
                'with test performance %f %%'  
            )  
            % (best_validation_loss * 100., test_score * 100.)  
        )  
        print('The code run for %d epochs, with %f epochs/sec' % (  
            epoch, 1. * epoch / (end_time - start_time)))  
        print(('The code for file ' +  
               os.path.split(__file__)[1] +  
               ' ran for %.1fs' % ((end_time - start_time))), file=sys.stderr)  
  
    def run(self, data):  
        print("run the model")  
        classifier = pickle.load(open('best_model.pkl', 'rb'))  
        predict_model = theano.function(  
            inputs=[classifier.input],  
            outputs=classifier.y_pred  
        )  
        rst = predict_model(data)  
        print(rst)  

在这里的train方法，与上篇博文处理MNIST手写数字识别的代码基本一致，只需要注意以下几点：首先，由于我们只有6个样本，因此将样本批次的大小设置为1（在MNIST手写数字识别中，由于有6万个训练样本，所以批次大小为600）；其次，在初始化逻辑回归模型时，输入维度n_in，设置为2，表示样本只有两个特征即x,y坐标，输出维度也为2，表示有两个类别，1是在y=x线上，0代表不在线上。

接着我们定义逻辑回归模型类LogisticRegression，源码文件为logistic_regression.py，代码如下所示：

[python] view plain copy
 
from __future__ import print_function  
  
__docformat__ = 'restructedtext en'  
  
import six.moves.cPickle as pickle  
import gzip  
import os  
import sys  
import timeit  
  
import numpy  
  
import theano  
import theano.tensor as T  
  
class LogisticRegression(object):    
    def __init__(self, input, n_in, n_out):    
        self.W = theano.shared(    
            value=numpy.zeros(    
                (n_in, n_out),    
                dtype=theano.config.floatX    
            ),    
            name='W',    
            borrow=True    
        )    
        self.b = theano.shared(    
            value=numpy.zeros(    
                (n_out,),    
                dtype=theano.config.floatX    
            ),    
            name='b',    
            borrow=True    
        )    
        self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)    
        self.y_pred = T.argmax(self.p_y_given_x, axis=1)    
        self.params = [self.W, self.b]    
        self.input = input    
        print("Yantao: ***********************************")  
    
    def negative_log_likelihood(self, y):    
        return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])    
    
    def errors(self, y):    
        if y.ndim != self.y_pred.ndim:    
            raise TypeError(    
                'y should have the same shape as self.y_pred',    
                ('y', y.type, 'y_pred', self.y_pred.type)    
            )    
        if y.dtype.startswith('int'):    
            return T.mean(T.neq(self.y_pred, y))    
        else:    
            raise NotImplementedError()    

上面的代码与上篇博文几乎没有变化，只是将其单独保存到一个文件中而已。

接下来是模型训练lr_train.py，代码如下所示：

[python] view plain copy
 
from __future__ import print_function  
  
__docformat__ = 'restructedtext en'  
  
import six.moves.cPickle as pickle  
import gzip  
import os  
import sys  
import timeit  
  
import numpy  
  
import theano  
import theano.tensor as T  
  
from logistic_regression import LogisticRegression  
from seg_loader import SegLoader  
from lr_seg_engine import LrSegEngine  
  
if __name__ == '__main__':  
    engine = LrSegEngine()  
    engine.train()  

上面代码只是简单调用逻辑回归分割的引擎类的train方法，完成对模型的训练，其会将最佳的结果保存到best_model.pkl文件中。

当模型训练好之后，我们就可以拿模型来进行分类了，lr_run.py的代码如下所示：

[python] view plain copy
 
from seg_loader import SegLoader  
from lr_seg_engine import LrSegEngine  
  
if __name__ == '__main__':  
    print("test program v1.0")  
    engine = LrSegEngine()  
    data = [[2.0, 2.0]]  
    print(data)  
    engine.run(data)  

上面代码首先初始化一个二维数组，其中只有一个样本元素，坐标为(2.0, 2.0)，然后调用逻辑回归分割引擎的run方法，其将给出分类结果，运行这个程序，会得到类似如下所示的结果：

test program v1.0

Logistic Regression MNIST Engine

[[2.0, 2.0]]

run the model

[1]

0 0