深度学习算法实践6---逻辑回归算法应用
来源:互联网 发布:湖州公安网络报警网站 编辑:程序博客网 时间:2024/04/27 15:58
在上篇博文中,我们介绍了深度学习算法的实现,并且以MNIST手写数字识别为例,验证了该算法的有效性。
但是我们学习逻辑回归算法的目的是解决我们的实际问题,而不是学习算法本身。逻辑回归算法在实际中的应用还是很广泛的,例如在医学领域的疾病预测中,我们就可以列出一系疾病相关因素,然后根据某位患者的具体情况,应用逻辑回归算法,判断该患者是否患有某种疾病。当然,逻辑回归算法还是有局限性的,其比较适合于处理线性可分的分类问题,但是对于线性不可分的分类问题,这种算法的价值就会大打折扣了。但是我们可以将逻辑回归算法,视为没有隐藏层的前馈网络,通过增加隐藏层,就可以处理各种线性不可分问题了。借助于Theano的框架,在后面博文中我们会介绍BP网络、多层卷积网络(LeNet),大家可以看到,在Theano中,实现这些模型是一件非常简单的事情。
言归正传,如果我们要用逻辑回归算法解决实际问题,我们主要需要改变的就是load_data函数,使其从我们规定的数据源中读取数据。在此,我们先设计一个训练数据读入的工具类SegLoader,文件名为seg_loader.py,代码如下所示:
- from __future__ import print_function
-
- __docformat__ = 'restructedtext en'
-
- import six.moves.cPickle as pickle
- import gzip
- import os
- import sys
- import timeit
-
- import numpy
-
- import theano
- import theano.tensor as T
-
- class SegLoader(object):
- def load_data(self, dataset):
- samplesNumber = 6
- features = 2
- train_set = (numpy.ndarray(shape=(samplesNumber, features), dtype=numpy.float32), numpy.ndarray(shape=(samplesNumber), dtype=int))
- self.prepare_dataset(train_set)
- valid_set = (train_set[0].copy(), train_set[1].copy())
- test_set = (train_set[0].copy(), train_set[1].copy())
- test_set_x, test_set_y = self.shared_dataset(test_set)
- valid_set_x, valid_set_y = self.shared_dataset(valid_set)
- train_set_x, train_set_y = self.shared_dataset(train_set)
- rval = [(train_set_x, train_set_y), (valid_set_x, valid_set_y),
- (test_set_x, test_set_y)]
- return rval
-
- def shared_dataset(self, data_xy, borrow=True):
- data_x, data_y = data_xy
- shared_x = theano.shared(numpy.asarray(data_x,
- dtype=theano.config.floatX),
- borrow=borrow)
- shared_y = theano.shared(numpy.asarray(data_y,
- dtype=theano.config.floatX),
- borrow=borrow)
- return shared_x, T.cast(shared_y, 'int32')
-
- def prepare_dataset(self, dataset):
- dataset[0][0][0] = 1.0
- dataset[0][0][1] = 1.0
- dataset[1][0] = 1
-
- dataset[0][1][0] = 2.0
- dataset[0][1][1] = 2.0
- dataset[1][1] = 1
-
- dataset[0][2][0] = 3.0
- dataset[0][2][1] = 3.0
- dataset[1][2] = 1
-
- dataset[0][3][0] = 1.5
- dataset[0][3][1] = 2.0
- dataset[1][3] = 0
-
- dataset[0][4][0] = 2.5
- dataset[0][4][1] = 4.0
- dataset[1][4] = 0
-
- dataset[0][5][0] = 3.5
- dataset[0][5][1] = 7.0
- dataset[1][5] = 0
上面的代码非常简单,生成一个元组train_set,包含两个元素,第一个元素是一个类型为float32的二维数组,每行代表一个样本,第一列代表X坐标,第二列代表Y坐标,train_set元组的第二个元素为一维整数数组,每个元素代表一个样本的分类结果,这里有两个大类,1代表在Y=X的直线上,0代表不在该直线上,prepare_dataset准备了6个训练样。因为这个问题非常简单,所以6个样本基本就够用了,但是对实际问题而言,显然需要相当大的样本量。接着我们定义这个线性分割的执行引擎LrSegEngine,源码文件为lr_seg_engine.py,代码如下所示:
- from __future__ import print_function
-
- __docformat__ = 'restructedtext en'
-
- import six.moves.cPickle as pickle
- import gzip
- import os
- import sys
- import timeit
-
- import numpy
-
- import theano
- import theano.tensor as T
- from logistic_regression import LogisticRegression
- from seg_loader import SegLoader
-
- class LrSegEngine(object):
- def __init__(self):
- print("Logistic Regression MNIST Engine")
- self.learning_rate = 0.13
- self.n_epochs = 1000
- self.batch_size = 1
- self.dataset = 'mnist.pkl.gz'
-
- def train(self):
- print("Yantao:train the model")
- loader = SegLoader()
- datasets = loader.load_data(self.dataset)
- train_set_x, train_set_y = datasets[0]
- valid_set_x, valid_set_y = datasets[1]
- test_set_x, test_set_y = datasets[2]
- n_train_batches = train_set_x.get_value(borrow=True).shape[0] // self.batch_size
- n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] // self.batch_size
- n_test_batches = test_set_x.get_value(borrow=True).shape[0] // self.batch_size
- index = T.lscalar()
- x = T.matrix('x')
- y = T.ivector('y')
-
- classifier = LogisticRegression(input=x, n_in=2, n_out=2)
- cost = classifier.negative_log_likelihood(y)
- test_model = theano.function(
- inputs=[index],
- outputs=classifier.errors(y),
- givens={
- x: test_set_x[index * self.batch_size: (index + 1) * self.batch_size],
- y: test_set_y[index * self.batch_size: (index + 1) * self.batch_size]
- }
- )
- validate_model = theano.function(
- inputs=[index],
- outputs=classifier.errors(y),
- givens={
- x: valid_set_x[index * self.batch_size: (index + 1) * self.batch_size],
- y: valid_set_y[index * self.batch_size: (index + 1) * self.batch_size]
- }
- )
- g_W = T.grad(cost=cost, wrt=classifier.W)
- g_b = T.grad(cost=cost, wrt=classifier.b)
- updates = [(classifier.W, classifier.W - self.learning_rate * g_W),
- (classifier.b, classifier.b - self.learning_rate * g_b)]
- train_model = theano.function(
- inputs=[index],
- outputs=cost,
- updates=updates,
- givens={
- x: train_set_x[index * self.batch_size: (index + 1) * self.batch_size],
- y: train_set_y[index * self.batch_size: (index + 1) * self.batch_size]
- }
- )
- patience = 5000
- patience_increase = 2
- improvement_threshold = 0.995
- validation_frequency = min(n_train_batches, patience // 2)
- best_validation_loss = numpy.inf
- test_score = 0.
- start_time = timeit.default_timer()
- done_looping = False
- epoch = 0
- while (epoch < self.n_epochs) and (not done_looping):
- epoch = epoch + 1
- for minibatch_index in range(n_train_batches):
- minibatch_avg_cost = train_model(minibatch_index)
-
- iter = (epoch - 1) * n_train_batches + minibatch_index
- if (iter + 1) % validation_frequency == 0:
-
- validation_losses = [validate_model(i)
- for i in range(n_valid_batches)]
- this_validation_loss = numpy.mean(validation_losses)
- print(
- 'epoch %i, minibatch %i/%i, validation error %f %%' %
- (
- epoch,
- minibatch_index + 1,
- n_train_batches,
- this_validation_loss * 100.
- )
- )
- if this_validation_loss < best_validation_loss:
-
- if this_validation_loss < best_validation_loss * improvement_threshold:
- patience = max(patience, iter * patience_increase)
- best_validation_loss = this_validation_loss
-
- test_losses = [test_model(i)
- for i in range(n_test_batches)]
- test_score = numpy.mean(test_losses)
- print(
- (
- ' epoch %i, minibatch %i/%i, test error of'
- ' best model %f %%'
- ) %
- (
- epoch,
- minibatch_index + 1,
- n_train_batches,
- test_score * 100.
- )
- )
-
- with open('best_model.pkl', 'wb') as f:
- pickle.dump(classifier, f)
- if patience <= iter:
- done_looping = True
- break
- end_time = timeit.default_timer()
- print(
- (
- 'Optimization complete with best validation score of %f %%,'
- 'with test performance %f %%'
- )
- % (best_validation_loss * 100., test_score * 100.)
- )
- print('The code run for %d epochs, with %f epochs/sec' % (
- epoch, 1. * epoch / (end_time - start_time)))
- print(('The code for file ' +
- os.path.split(__file__)[1] +
- ' ran for %.1fs' % ((end_time - start_time))), file=sys.stderr)
-
- def run(self, data):
- print("run the model")
- classifier = pickle.load(open('best_model.pkl', 'rb'))
- predict_model = theano.function(
- inputs=[classifier.input],
- outputs=classifier.y_pred
- )
- rst = predict_model(data)
- print(rst)
在这里的train方法,与上篇博文处理MNIST手写数字识别的代码基本一致,只需要注意以下几点:首先,由于我们只有6个样本,因此将样本批次的大小设置为1(在MNIST手写数字识别中,由于有6万个训练样本,所以批次大小为600);其次,在初始化逻辑回归模型时,输入维度n_in,设置为2,表示样本只有两个特征即x,y坐标,输出维度也为2,表示有两个类别,1是在y=x线上,0代表不在线上。接着我们定义逻辑回归模型类LogisticRegression,源码文件为logistic_regression.py,代码如下所示:
- from __future__ import print_function
-
- __docformat__ = 'restructedtext en'
-
- import six.moves.cPickle as pickle
- import gzip
- import os
- import sys
- import timeit
-
- import numpy
-
- import theano
- import theano.tensor as T
-
- class LogisticRegression(object):
- def __init__(self, input, n_in, n_out):
- self.W = theano.shared(
- value=numpy.zeros(
- (n_in, n_out),
- dtype=theano.config.floatX
- ),
- name='W',
- borrow=True
- )
- self.b = theano.shared(
- value=numpy.zeros(
- (n_out,),
- dtype=theano.config.floatX
- ),
- name='b',
- borrow=True
- )
- self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
- self.y_pred = T.argmax(self.p_y_given_x, axis=1)
- self.params = [self.W, self.b]
- self.input = input
- print("Yantao: ***********************************")
-
- def negative_log_likelihood(self, y):
- return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
-
- def errors(self, y):
- if y.ndim != self.y_pred.ndim:
- raise TypeError(
- 'y should have the same shape as self.y_pred',
- ('y', y.type, 'y_pred', self.y_pred.type)
- )
- if y.dtype.startswith('int'):
- return T.mean(T.neq(self.y_pred, y))
- else:
- raise NotImplementedError()
上面的代码与上篇博文几乎没有变化,只是将其单独保存到一个文件中而已。接下来是模型训练lr_train.py,代码如下所示:
- from __future__ import print_function
-
- __docformat__ = 'restructedtext en'
-
- import six.moves.cPickle as pickle
- import gzip
- import os
- import sys
- import timeit
-
- import numpy
-
- import theano
- import theano.tensor as T
-
- from logistic_regression import LogisticRegression
- from seg_loader import SegLoader
- from lr_seg_engine import LrSegEngine
-
- if __name__ == '__main__':
- engine = LrSegEngine()
- engine.train()
上面代码只是简单调用逻辑回归分割的引擎类的train方法,完成对模型的训练,其会将最佳的结果保存到best_model.pkl文件中。当模型训练好之后,我们就可以拿模型来进行分类了,lr_run.py的代码如下所示:
- from seg_loader import SegLoader
- from lr_seg_engine import LrSegEngine
-
- if __name__ == '__main__':
- print("test program v1.0")
- engine = LrSegEngine()
- data = [[2.0, 2.0]]
- print(data)
- engine.run(data)
上面代码首先初始化一个二维数组,其中只有一个样本元素,坐标为(2.0, 2.0),然后调用逻辑回归分割引擎的run方法,其将给出分类结果,运行这个程序,会得到类似如下所示的结果:test program v1.0
Logistic Regression MNIST Engine
[[2.0, 2.0]]
run the model
[1]
0 0