深度学习小白——tensorflow(四)CIFAR-10实例

来源:互联网 发布:python spark sql 编辑:程序博客网 时间:2024/06/06 06:40

最近看了https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10这个实例,里面有大量的tf库函数,看了好久才大概看个明白,想在此做个学习笔记,把一些函数用途以及整个CNN网络框架记录下来。


一、数据读取

因为之前写过,见http://blog.csdn.net/margretwg/article/details/70168256,这里就不重复了


二、模型建立


全局参数

import osimport reimport sysimport tarfileimport tensorflow as tfimport CIFAR10.CIFAR_input as inputFLAGS=tf.app.flags.FLAGS#模型参数tf.app.flags.DEFINE_integer('batch_size', 128,                            """Number of images to process in a batch.""")tf.app.flags.DEFINE_string('data_dir', 'E:/Python/tensorflow/CIFAR10',                           """Path to the CIFAR-10 data directory.""")tf.app.flags.DEFINE_boolean('use_fp16', False,                            """Train the model using fp16.""")#全局变量IMAGE_SIZE=input.IMAGE_SIZENUM_CLASSES=input.NUM_CLASSESNUM_EXAMPLES_PER_EPOCH_FOR_TRAIN=input.NUM_EXAMPLES_PER_EPOCH_FOR_TRAINNUM_EXAMPLES_PER_EPOCH_FOR_EVAL=input.NUM_EXAMPLES_PER_EPOCH_FOR_EVAL#训练过程中的常量MOVING_AVERAGE_DECAY=0.9999NUM_EPOCH_PER_DECAY=350.0 #epochs after which learning rate decaysLEARNING_RATE_DECAY_FACTOR=0.1 #学习率衰减因子INITIAL_LEARNING_RATE=0.1



2.1 模型预测inference()

主要有:conv1-->pool1-->norm1-->conv2-->norm2-->pool2-->local3-->local4-->softmax_linear

该模块返回的是(128,10)的张量

def inference(images):    """    创建CIFAR-10模型    :param images: Images来自distorted_inputs()或inputs()    :return:    Logits神经元    """    #conv1    with tf.variable_scope('conv1')as scope:        kernel=_variable_with_weight_decay('weights',shape=[5,5,3,64],stddev=5e-2,wd=0.0)        conv=tf.nn.conv2d(images,kernel,[1,1,1,1],padding='SAME')#卷积操作        biases=_variable_on_cpu('biases',[64],tf.constant_initializer(0.0))        pre_activation=tf.nn.bias_add(conv,biases)# WX+b        conv1=tf.nn.relu(pre_activation,name=scope.name)        _activation_summary(conv1)    #pool1    pool1=tf.nn.max_pool(conv1,ksize=[1,3,3,1],strides=[1,2,2,1],padding='SAME',name='pool1')    #norm1    norm1=tf.nn.lrn(pool1,4,bias=1.0,alpha=0.001/9.0,beta=0.75,name='norm1')    #conv2    with tf.variable_scope('conv2') as scope:        kernel=_variable_with_weight_decay('weights',shape=[5,5,64,64],stddev=5e-2,wd=0.0)        conv=tf.nn.conv2d(norm1,kernel,[1,1,1,1],padding='SAME')        biases=_variable_on_cpu('biases',[64],tf.constant_initializer(0.1))        pre_activation=tf.nn.bias_add(conv,biases)        conv2=tf.nn.relu(pre_activation,name=scope.name)        _activation_summary(conv2)     #norm2    norm2=tf.nn.lrn(conv2,4,bias=1.0,alpha=0.001/9.0,beta=0.75,name='norm2')    #pool2    pool2=tf.nn.max_pool(norm2,ksize=[1,3,3,1],strides=[1,2,2,1],padding='SAME',name='pool2')    #local3    with tf.variable_scope('local3')as scope:        #Move everything into depth so we can perform a single matrix multiply        reshape=tf.reshape(pool2,[FLAGS.batch_size,-1])        dim=reshape.get_shape()[1].value        weights=_variable_with_weight_decay('weights',shape=[dim,384],stddev=0.04,wd=0.004)        biases=_variable_on_cpu('biases',[384],tf.constant_initializer(0.1))        local3=tf.nn.relu(tf.matmul(reshape,weights)+biases,name=scope.name)        _activation_summary(local3)     #local4    with tf.variable_scope('local4') as scope:        weights = _variable_with_weight_decay('weights', shape=[384, 192],                                              stddev=0.04, wd=0.004)        biases = _variable_on_cpu('biases', [192], tf.constant_initializer(0.1))        local4 = tf.nn.relu(tf.matmul(local3, weights) + biases, name=scope.name)        _activation_summary(local4)    with tf.variable_scope('softmax_linear') as scope:        weights=_variable_with_weight_decay('weights',[192,NUM_CLASSES],stddev=1/192.0,wd=0.0)        biases=_variable_on_cpu('biases',[NUM_CLASSES],tf.constant_initializer(0.0))        softmax_linear=tf.add(tf.matmul(local4,weights),biases,name=scope.name)        _activation_summary(softmax_linear)    return softmax_linear #


其中,_variable_with_weight_decay()函数用于初始化weights,并且这里带一个衰减系数wd,用于计算权重衰减loss,加入到collection中,方便最后计算total_loss

def _variable_with_weight_decay(name,shape,stddev,wd):    """    Helper to create an initialized Variable with weight decay    这里变量被初始化为截断正态分布    :param stddev:标准差    :param wd: add L2 loss weight decay multiplied by this float. If None, weight decay is not added for this Variable    :return:    Variable tensor    """    dtype=tf.float16 if FLAGS.use_fp16 else tf.float32    var=_variable_on_cpu(name,shape,tf.truncated_normal_initializer(stddev=stddev,dtype=dtype))    if wd is not None:      weight_decay=tf.multiply(tf.nn.l2_loss(var),wd,name='weight_loss')      tf.add_to_collection('losses',weight_decay)    return var



_variable_on_cpu()函数即在CPU上创建初始化了的name=name,shape=shape的变量

def _variable_on_cpu(name,shape,initializer):    """    Helper to create a Variable stored oon CPU memory    :param name: 变量名    :param shape: lists of ints    :param initializer: 初始化变量值    :return:    Variable Tensor    """    with tf.device('/cpu:0'):        dtype=tf.float16 if FLAGS.use_fp16 else tf.float32        var=tf.get_variable(name,shape,initializer=initializer,dtype=dtype)        return var


[补1——collection]:

tensorflow 的collection提供了一个全局的存储机制,不会受到变量名生存空间的影响,一处保存,到处可取

(1)tf.Graph.add_to_collection(name,value) 向collection中存数据

collection不是set,所以一个'name'下可以存很多值, tf.add_to_collection(name,value)是给默认图使用的

(2)tf.Graph.get_collection(name,scope=None)

返回名字为name的list of values in the collection,scope不为None的时候,the resulting list is filtered to include only items whose name attribute matches using re.math,items without a name attribute are never returned.因此此例没有用这个参数,所以我具体也不太清楚这个scope是干嘛的··以后碰到再补充吧


2.2 算loss

对所有学习变量应用权重衰减损失。模型的目标函数是求交叉熵损失和所有权重衰减项的和

def loss(logits,labels):    """    Add L2loss to all the trainable variables    Add summary for "loss" and "loss/avg"    :param logits: logits from inference()    :param labels: labels from distorted_inputs or inputs() 1-D tensor of shape[batch_size]    :return: loss tensor of type float    """    #计算平均交叉熵损失对一个batch    labels=tf.cast(labels,tf.int64)    cross_entropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels,logits=logits,name="cross_entropy_per_exapmle")    cross_entropy_mean=tf.reduce_mean(cross_entropy,name='cross_entropy')    tf.add_to_collection('losses',cross_entropy_mean)    #总共的损失应该是交叉熵损失加上权重衰减项(L2 LOSS)    #权重的二范数值刚刚也加到了'losses'的collection里,这里的tf.add_n()就是将loss和刚刚的weights的二范数值对应相加    return tf.add_n(tf.get_collection('losses'),name='total_loss')


【补2】tf.nn.sparse_softmax_cross_entropy_with_logits(_sentinel=None,labels=None,logits=None,name=None)
计算稀疏softmax交叉熵between labels和logits,该函数针对那种每一个样本对应一个离散的独立的分类任务,如CIFAR-10,也就是说soft classes 在这里是不允许的,label 向量必须提供一个单一具体的index对于每一行(样本)logits。对于soft softmax分类,用tf.nn.softmax_cross_entropy_with_logtis()

返回一个与‘labels’一样大小的tensor,里面是每个样本的loss

==================================================================================================

【补3】

tf.add_n(inputs,name=None)

Add all input tensors element-wise

返回与inputs里面元素大小一样的tensor

此处将collection 里面叫‘losses’的元素list全加起来,就是把刚算的平均loss和所有不同层的weights的二范数值加起来得到total_loss



2.3 更新参数/train_op

添加一些操作使得目标函数最小化,这些操作包括计算梯度、更新学习变量, 函数最终会返回一个用以对一批图像执行所有计算的操作步骤(train_op),以便训练并更新模型。


def train(total_loss,global_step):    """    Train CIFAR-10 model    设立优化器,并对于所有可训练变量添加滑动平均    :param total_loss:Total loss from loss()    :param global_step:integer Varibale conunting the number of trainnig steps processed    :return: train_op:op for training    """    #Variables that affect learning rate    num_batches_per_epoch=NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN/FLAGS.batch_size    decay_steps=int(num_batches_per_epoch* NUM_EPOCH_PER_DECAY)    #decay the learning rate exponentially based on the number of steps    #随着迭代过程衰减学习率    lr=tf.train.exponential_decay(INITIAL_LEARNING_RATE,global_step,decay_steps,LEARNING_RATE_DECAY_FACTOR,staircase=True)    tf.summary.scalar('learning_rate',lr)    #滑动平均 of all losses and associated summaries    loss_averages_op=_add_loss_summaries(total_loss)    #计算梯度    with tf.control_dependencies([loss_averages_op]):        opt=tf.train.GradientDescentOptimizer(lr)        grads=opt.compute_gradients(total_loss)    #apply gradients    apply_gradient_op=opt.apply_gradients(grads,global_step=global_step)    #This is the second part of `minimize()`. It returns an `Operation` that applies gradients.    #add histogram    for grad,var in grads:        if grad is not None:            tf.summary.histogram(var.op.name+'/gradients',grad)    # Track the moving averages of all trainable variables.    variable_averages = tf.train.ExponentialMovingAverage(        MOVING_AVERAGE_DECAY, global_step)    variables_averages_op = variable_averages.apply(tf.trainable_variables())    with tf.control_dependencies([apply_gradient_op, variables_averages_op]):        train_op = tf.no_op(name='train')    return train_op


先设立学习率,此处学习率是随着迭代过程衰减的

【补4】tf.train.exponential_decay(learning_rate, global_step, decay_steps, decay_rate, staircase=False, name=None)

用于使学习率指数衰减,公式为:decayed_learning_rate = learning_rate *decay_rate ^ (global_step / decay_steps)

参数:

learning_rate:初始学习率 float

global_step:必须为负值用于衰减计算,这里是整数变量,计算着已执行的训练步骤数

decay_steps:必须为正值,此处为每次迭代经历的batch数*每次衰减要经过的迭代数

staircase: 如果true,则衰减的学习率为离散的整数

返回衰减的学习率的值,然后用tf.summary.scalar()添加一个标量‘learning_rate’以便观察

============================================================================================

_add_loss_summaries()

将所有loss计算滑动平均后的值存储到collection 'losses'里,并依次以scalar存入summary中

返回一个op用于得到losses的滑动平均值

def _add_loss_summaries(total_loss):    """    Add summaries for losses in CIFAR-10 model    Generates moving average for all losses and associated summaries of visualizing the performnce of the network    :param total_loss:Total loss from loss()    :return:    loss_averages_op: op for generating moving averages of losses    """    #计算moving average of all individual losses and the total loss    #MovingAverage为滑动平均,计算方法:对于一个给定的数列,首先设定一个固定的值k,然后分别计算第1项到第k项,第2项到第k+1项,第3项到第k+2项的平均值,依次类推。    loss_averages=tf.train.ExponentialMovingAverage(0.9,name='avg')    losses=tf.get_collection('losses')    loss_averages_op=loss_averages.apply(losses+[total_loss])    #给每一个单独的losses和total loss attach a scalar summary;do the same    #for the averaged version of the losses    for l in losses+[total_loss]:        tf.summary.scalar(l.op.name+'(raw)',l)        tf.summary.scalar(l.op.name,loss_averages.average(l))    return loss_averages_op


【补6】loss_averages= tf.train.ExponentialMovingAverage()

这是一个创立了一个ExponentialMovingAverage类对象

当训练一个模型时,保存已训参数的滑动平均值更好,可以得到更好的结果。此处主要使用了其apply()方法,所以主要介绍这个方法

  • __init__(self, decay, num_updates=None, zero_debias=False, name='ExponentialMovingAverage')

  • apply(self, var_list=None)

Maintains moving averages of variables.方法添加了一个关于已训变量的影子副本,而且添加了能保留变量的滑动平均值在副本中的op,这个op通常在每一步训练步骤后

返回一个op,注意apply()方法可以被调用多次,每次有不同的lists of variables


【补7】 with tf.control_dependencies(control_inputs):

control_inputs: list of  ops 或者tensors对象,而且这个list里的对象必须在context定义的那些操作之前完成,形成依赖关系

此例中得到滑动平均op(loss_averages_op)后,与梯度下降形成依赖关系,先执行滑动平均更新loss,然后再以这个loss为目标函执行梯度下降。


【补8】tf.train.GradientDescentOptimizer()

这里梯度下降也使用到了一个GradientDescentOptimizer类对象,用到的方法有

  • __init__(self, learning_rate, use_locking=False, name='GradientDescent')
创建一个新的梯度下降优化器
  • compute_gradients(self, loss, var_list=None, gate_gradients=1, aggregation_method=None, colocate_gradients_with_ops=False, grad_loss=None)
计算loss的梯度关于‘var_list’里的变量,默认情况下,是图中所有trainable variables,注意‘gradient’可以是tensor,也可以是None如果对于某一变量没有梯度。
返回list of (gradient, variable)pairs, Variable永远都存在,gradient can be None

  • apply_gradients(self, grads_and_vars, global_step=None, name=None)
Apply gradients to variables对变量执行梯度更新
返回一个执行梯度更新的op


最后对于每一个grads里梯度和变量,建立histogram在summary中,然后创立一个对变量的滑动平均op
最后创立一个整体的train_op,与梯度更新op和变量滑动平均op形成依赖关系,使得执行train_op之前必须执行前面两个op,所以会话中只需直接执行train_op就能运行另两个op,至此,模型建立结束

三、训 练

全局参数

from datetime import datetimeimport timeimport tensorflow as tffrom CIFAR10 import model_buildFLAGS=tf.app.flags.FLAGStf.app.flags.DEFINE_string('train_dir','E:/Python/tensorflow/CIFAR10',"""Directorywhere to write event logs and checkpoint""")tf.app.flags.DEFINE_integer('max_steps',100000,"""Number of batches to run.""")tf.app.flags.DEFINE_boolean('log_device_placement', False,                            """Whether to log device placement.""")tf.app.flags.DEFINE_integer('log_frequency', 10,                            """How often to log results to the console.""")

train函数

def train1():    with tf.Graph().as_default():        global_step=tf.contrib.framework.get_or_create_global_step()        #use the default graph in the process in the context        #global_step=tf.Variable(0,name='global_step',trainable=False)        #获取图像和标签        images,labels=model_build.distorted_inputs()        #创建一个图来计算神经元预测值,前向传播        logits=model_build.inference(images)        #计算loss        loss=model_build.loss(logits,labels)        #建一个图来来训练一个Batch的样本然后更新参数        train_op=model_build.train(loss,global_step)        #专门定义_LoggerHook类,在mon_sess这个对话中注册        class _LoggerHook(tf.train.SessionRunHook):            """            Logs loss and runtime.            """            def begin(self):                self._step=-1                self._start_time=time.time()            def before_run(self,run_context):                #Called before each call to run()                #返回‘SessionRunArgs’对象意味着ops或者tensors去加入即将到来的run(),                #这些ops和tensor回合之前的一起送入run()                #run()的参数里还可以包括你要feed的东西                #run_context参数包括了即将到来的run()的信息:原始的op和tensors                #当该函数运行完,图就确定了,就不能再加op了                self._step+=1                return tf.train.SessionRunArgs(loss) #Asks for loss value            def after_run(self,run_context,run_values):                #Called after eah call to run()                #'run value' argument contains results of requested ops/tensors by'before_run'                #the 'run_context' argument 与送入before_run的是一样的                #'run_context.request_stop()'can be called to stop the iteration                if self._step % FLAGS.log_frequency==0:#当取了FLAGS.log_frequency个batches的时候                    current_time=time.time()                    duration=current_time-self._start_time                    self._start_time=current_time                    loss_value=run_values.results                    examples_per_sec=FLAGS.log_frequency* FLAGS.batch_size/duration                    sec_per_barch=float(duration/FLAGS.log_frequency)                    format_str=('%s:step %d,loss=%.2f (%.1f examples/sec; %.3f' 'sec/batch')                    print(format_str %(datetime.now(),self._step,loss_value,examples_per_sec,sec_per_barch))        with tf.train.MonitoredTrainingSession(            #set proper session intializer/restorer,it also creates hooks related to            #checkpoint and summary saving            checkpoint_dir=FLAGS.train_dir,            hooks=[tf.train.StopAtStepHook(last_step=FLAGS.max_steps),tf.train.NanTensorHook(loss),                   _LoggerHook()],            config=tf.ConfigProto(                log_device_placement=FLAGS.log_device_placement)) as mon_sess:            while not mon_sess.should_stop():                mon_sess.run(train_op)                #此处表示在停止条件到达之前,循环运行train_op,更新模型参数def main(argv=None):            train1()if __name__=='__main__':            tf.app.run(main=main)


先用with tf.Graph().as_default()使得所有操作在默认图下,用with框起来表明以下所有ops都被加到这个图中
如果你想创建一个新的线程,然后希望把这个新的线程添加到这个图中,那么你必须添加上“with g.as_default()“

然后调用model_build里面的输入函数,inferencce函数还有loss函数,train函数,最终得到了train_op
然后定义了一个_LoggerHook()类对象,继承了tf.train.SeesionRunHook类

【补9】tf.train.SeesionRunHook类
我大概理解就是这是一个会话悬停对象等待MonitoredSession.run()来运行它
这个类有如下方法:
  • after_create_session(self,session,coord)
当新的tensorflow session被创建时调用该函数,当它被调用时,graph是固定的,ops不能被加入图中
这个方法还会在恢复一个wrapped session时被调用
Args:
       session: A TensorFlow Session that has been created.
        coord: A Coordinator object which keeps track of all threads.
  • after_run(self, run_context, run_values)
在每次调用完run()后被调用,‘run_values’参数包括了before_run()要求的ops/tensors的结果
'run_context'与送入before_run()中的一样

Args:
         run_context: A `SessionRunContext` object.
         run_values: A SessionRunValues object.

   这里if条件当运行步数为FLAGS.log_frequency(10)的整数倍时,即每处理10个batches记录当前时间,算出消耗的时间,因为在before_run()里我们把loss加入到session.run()中,所以此处run_value.results就是loss的值,输出结果
  • before_run(self, run_context)
在每次调用run()前被调用,你可以从这个函数返回一个‘SessionRunArgs’对象 表明要加入即将要run()的call中的ops和tensors
‘run_context’参数就提供了即将run()call的信息

这里执行步骤数+1,然后返回SessionRunArgs类对象,表明要把loss加入到session.run()
  • begin(self)
在使用session之前调用,hook可以改变graph通过加入新的ops,在begin()之后,图就被固定,其他调用不能再改变图
    这里begin()里面定义步骤数为-1,定义了一个起始时间

=======================================================================================
最后,定义tf.train.MonitoredTrainingSession()类

【补10】
tf.train.MonitoredTrainingSession(master='', is_chief=True, checkpoint_dir=None, scaffold=None, hooks=None, chief_only_hooks=None, save_checkpoint_secs=600, save_summaries_steps=100, config=None

创建一个MonitoredSession for training
对于chief,这个可以设置合适的session初始化/恢复器,它还可以创建与checkpoint和summary saving有关的hooks
对于worker,要等待chief去初始化/恢复session
Args:
master:
`String` the TensorFlow master to use.

is_chief: If `True`,它会负责初始化、恢复正在进行的session, If `False`,它会等待chief去初始化或恢复session

checkpoint_dir: A string. 储存checkpoint的路径

scaffold: A `Scaffold` used for gathering or building supportive ops. If not specified, a default one is created. It's used to finalize the graph.

hooks: Optional list of `SessionRunHook` objects. 
此处设为[tf.train.StopAtStepHook(last_step=FLAGS.max_step),tf.train.NanTensorHook(loss),_LoggerHook()]
  • tf.train.StopAtStepHook()会话悬停类对象,它的作用是监视并提出停在特定步骤的请求
这个hook会请求要不在一定步数后或者达到最后一步的步数时停止
__init__(self, num_steps=None, last_step=None),所以此处将last_step设为最大步骤数,意思就是到达这个数,就停
  • tf.train.NanTensorHook()会话悬停类对象
它的作用就是监视loss,当loss为Nan时停下
  • _LoggerHook()就是我们自己定义的会话悬停对象用来执行loss的计算,时间的记录,打印等等

chief_only_hooks: list of `SessionRunHook` objects. Activate these hooks if `is_chief==True`, ignore otherwise.

save_checkpoint_secs: checkpoint保存频率,如果设为None,就不保存

save_summaries_steps: summary写入频率

config:
an instance of `tf.ConfigProto` proto used to configure the session. It's the `config` argument of constructor of `tf.Session`.
    
    Returns:
      A `MonitoredSession` object.



最后输出结果:

Filling queue with 20000 CIFAR images before starting to train.This will take a few minutes.2017-04-16 20:04:10.826531:step 0,loss=6.39 (25.3 examples/sec; 5.056sec/batch2017-04-16 20:04:36.614833:step 10,loss=6.22 (49.6 examples/sec; 2.579sec/batch2017-04-16 20:05:01.745663:step 20,loss=6.10 (50.9 examples/sec; 2.513sec/batch2017-04-16 20:05:27.068144:step 30,loss=6.01 (50.5 examples/sec; 2.532sec/batch
因为我使用CPU,所以跟官网指南上给的GPU版本速度差别很大



0 0