CNTK API文档翻译(22)——取样Softmax函数

来源：互联网发布：淘宝快递合作价格编辑：程序博客网时间：2024/06/06 11:50

在分类和预测的问题中，一个典型的准则函数是使用softmax的交叉熵成本函数。如果输出的分类值的数据很大，准则函数和相应参数的计算量可能会相当巨大。取样Softmax函数可能是加速训练的一个方向。

选择代码运行环境

在我们进入问题之前，先引入一些需要的库和做一些环境设置。

# Use a function definition from future version (say 3.x from 2.7 interpreter)from __future__ import print_function from __future__ import divisionimport osimport cntk as Cimport cntk.tests.test_utils# (only needed for our build system)cntk.tests.test_utils.set_device_from_pytest_env() # fix a random seed for CNTK componentsC.cntk_py.set_fixed_random_seed(1)

基础

Softmax函数在神经网络中主要用于我们想将神经网络的输出值表示基于大小是NC分类集C的概率分布时。

Softmax函数将一个NC大小，里面的值是凌乱的矢量z映射到一个大小仍然是NC，但里面的值是加起来和为1的非负数p，这个p可以用来表示概率。用公式表示就是：

p i = s o f t m a x (z, i) = e x p (z i) / \sum k \in C e x p (z k)

在这个公式里面，我们设softmax函数的输入值z是通过一些隐藏的长度为Nh矢量h计算出来的，比如：

z = W h + b

其中W是一个可以训练的大小为(Nc,Nh)的权重矩阵，b是一个可以训练的偏移量矢量。我们本次选定这个z的计算方式因为用它能够非常快速的实现取样softmax。

在一个典型的实际案例比如一个递归语言处理模型中，隐藏向量h将会是递归层的输出值，C将会是一系列要预测的单词。

作为训练的准则函数，我们选定交叉熵成本函数，他是使用类别t∈C的期望和预测的概率的函数。

c r o s s_e n t r o p y : = - l o g (p t)

外部取样Softmax函数

要是用普通的softmax函数，CNTK的Python API提供了cross_entropy_with_softmax函数。他以z为输入值，上面我们有提到，z是使用z=Wh+b计算得来的，这在我们的取样softmax实现时也应该是一部分。

下面我们展示了代码cross_entropy_with_sampled_softmax_and_embedding，我们先看他是如何声明的。

与Python API提供的函数最基本的不同是在Python API的函数中，输入矢量z必须和目标矢量大小相同，但是在我们函数中，输入的是隐藏矢量h，他可能是任意大小。实际上他的大小通常会比目标矢量的大小小。

我们还有一些参数num_samples，sampling_weights，allow_duplicates用来控制随机取样。另一个与API中提供的函数不同点是我们返回了一个元组(z, cross_entropy_on_samples, error_on_samples)。

贴完代码后我们将讲述实现的细节。

# Creates a subgraph computing cross-entropy with sampled softmax.def cross_entropy_with_sampled_softmax_and_embedding(    hidden_vector,     target_vector,    num_classes,    hidden_dim,    num_samples,    sampling_weights,    allow_duplicates = True    ):    # define the parameters learnable parameters    b = C.Parameter(shape = (num_classes, 1), init = 0)    W = C.Parameter(shape = (num_classes, hidden_dim), init = C.glorot_uniform())    # Define the node that generates a set of random samples per minibatch    # Sparse matrix (num_samples * num_classes)    sample_selector = C.random_sample(sampling_weights, num_samples, allow_duplicates)    # For each of the samples we also need the probablity that it in the sampled set.     # dense row [1 * vocab_size]    inclusion_probs = C.random_sample_inclusion_frequency(sampling_weights,    num_samples, allow_duplicates)    # dense row [1 * num_classes]    log_prior = C.log(inclusion_probs)    # Create a submatrix wS of 'weights    # [num_samples * hidden_dim]    W_sampled = C.times(sample_selector, W)     z_sampled = C.times_transpose(W_sampled, hidden_vector) + C.times(sample_selector, b) - C.times_transpose (sample_selector, log_prior)# [num_samples]    # Getting the weight vector for the true label. Dimension hidden_dim    # [1 * hidden_dim]    W_target = C.times(target_vector, W)     z_target = C.times_transpose(W_target, hidden_vector) + C.times(target_vector, b) - C.times_transpose(target_vector, log_prior) # [1]    z_reduced = C.reduce_log_sum_exp(z_sampled)    # Compute the cross entropy that is used for training.    # We don't check whether any of the classes in the random samples conincides with the true label, so it might    # happen that the true class is counted    # twice in the normalising demnominator of sampled softmax.    cross_entropy_on_samples = C.log_add_exp(z_target, z_reduced) - z_target    # For applying the model we also output a node providing the input for the full softmax    z = C.times_transpose(W, hidden_vector) + b    z = C.reshape(z, shape = (num_classes))    zSMax = C.reduce_max(z_sampled)    error_on_samples = C.less(z_target, zSMax)    return (z, cross_entropy_on_samples, error_on_samples)

为了让读者对我们的函数和普通的softmax函数的输入输出值的不同点有个直观了解，我们以上面代码的方式实现了一次普通的softmax函数。

# Creates subgraph computing cross-entropy with (full) softmax.def cross_entropy_with_softmax_and_embedding(    hidden_vector,  # Node providing hidden input    target_vector,  # Node providing the expected labels (as sparse vectors)    num_classes,    # Number of classes    hidden_dim      # Dimension of the hidden vector    ):    # Setup bias and weights    b = C.Parameter(shape = (num_classes, 1), init = 0)    W = C.Parameter(shape = (num_classes, hidden_dim), init = C.glorot_uniform())    z = C.reshape( C.times_transpose(W, hidden_vector) + b, (1, num_classes))    # Use cross_entropy_with_softmax    cross_entropy = C.cross_entropy_with_softmax(z, target_vector)    zMax = C.reduce_max(z)    zT = C.times_transpose(z, target_vector)    error_on_samples = C.less(zT, zMax)    return (z, cross_entropy, error_on_samples)

你能看出我们实现的函数和api提供的函数的不同点：

我们的函数中包含了z=Wh+b。
我们返回了一个元组(z, cross_entropy, error_on_samples) 而不是只返回交叉熵。

一个小例子

为了解释如何完整的使用取样softmax，让我们来看一个小例子。在这个示例中我们先将输入的一位有效码矢量通过随机映射转换成一个低维矢量h。我们的任务是使用softmax复原这个映射。当然，这就只是个示例，没有实际意义。

import numpy as npfrom math import log, exp, sqrtfrom cntk.logging import ProgressPrinterimport timeit# A class with all parametersclass Param:    # Learning parameters    learning_rate = 0.03    minibatch_size = 100    num_minbatches = 100    test_set_size = 1000    momentum_time_constant = 5 * minibatch_size    reporting_interval = 10    allow_duplicates = False    # Parameters for sampled softmax    use_sampled_softmax = True    use_sparse = True    softmax_sample_size = 10    # Details of data and model    num_classes = 50    hidden_dim = 10data_sampling_distribution = lambda: np.repeat(1.0 / Param.num_classes, Param.num_classes)softmax_sampling_weights = lambda: np.repeat(1.0 / Param.num_classes, Param.num_classes)# Creates random one-hot vectors of dimension 'num_classes'.# Returns a tuple with a list of one-hot vectors, and list with the indices they encode.def get_random_one_hot_data(num_vectors):    indices = np.random.choice(        range(Param.num_classes),        size=num_vectors,         p = data_sampling_distribution()).reshape((num_vectors, 1))    list_of_vectors = C.Value.one_hot(indices, Param.num_classes)    return (list_of_vectors, indices.flatten())# Create a network that:# * Transforms the input one hot-vectors with a constant random embedding# * Applies a linear decoding with parameters we want to learndef create_model(labels):    # random projection matrix    random_data = np.random.normal(scale = sqrt(1.0/Param.hidden_dim), size=(Param.num_classes, Param.hidden_dim)).astype(np.float32)    random_matrix = C.constant(shape = (Param.num_classes, Param.hidden_dim), value = random_data)    h = C.times(labels, random_matrix)    # Connect the latent output to (sampled/full) softmax.    if Param.use_sampled_softmax:        sampling_weights = np.asarray(softmax_sampling_weights(), dtype=np.float32)        sampling_weights.reshape((1, Param.num_classes))        softmax_input, ce, errs = cross_entropy_with_sampled_softmax_and_embedding(            h,             labels,            Param.num_classes,             Param.hidden_dim,             Param.softmax_sample_size,             softmax_sampling_weights(),            Param.allow_duplicates)    else:        softmax_input, ce, errs = cross_entropy_with_softmax_and_embedding(            h,             labels,             Param.num_classes,             Param.hidden_dim)    return softmax_input, ce, errsdef train(do_print_progress):    labels = C.input_variable(shape = Param.num_classes, is_sparse = Param.use_sparse)    z, cross_entropy, errs = create_model(labels)    # Setup the trainer    learning_rate_schedule = C.learning_rate_schedule(Param.learning_rate, C.UnitType.sample)    momentum_schedule = C.momentum_as_time_constant_schedule(Param.momentum_time_constant)    learner = C.momentum_sgd(z.parameters, learning_rate_schedule, momentum_schedule, True)    progress_writers = None    if do_print_progress:        progress_writers = [ProgressPrinter(freq=Param.reporting_interval, tag='Training')]    trainer = C.Trainer(z, (cross_entropy, errs), learner, progress_writers)    minbatch = 0    average_cross_entropy = compute_average_cross_entropy(z)    # store minibatch values    minbatch_data = [0]     # store cross_entropy values    cross_entropy_data = [average_cross_entropy]     # Run training    t_total= 0    # Run training    for minbatch in range(1,Param.num_minbatches):        # Specify the mapping of input variables in the model to actual minibatch data to be trained with        label_data, indices = get_random_one_hot_data(Param.minibatch_size)        arguments = ({labels : label_data})        # If do_print_progress is True, this will automatically print the progress using ProgressPrinter        # The printed loss numbers are computed using the sampled softmax criterion        t_start = timeit.default_timer()        trainer.train_minibatch(arguments)        t_end = timeit.default_timer()        t_delta = t_end - t_start        samples_per_second = Param.minibatch_size / t_delta        # We ignore the time measurements of the first two minibatches        if minbatch > 2:            t_total += t_delta        # For comparison also print result using the full criterion        if minbatch % Param.reporting_interval == int(Param.reporting_interval/2):            # memorize the progress data for plotting            average_cross_entropy = compute_average_cross_entropy(z)            minbatch_data.append(minbatch)            cross_entropy_data.append(average_cross_entropy)            if do_print_progress:                print("\nMinbatch=%d Cross-entropy from full softmax = %.3f perplexity = %.3f samples/s = %.1f"                    % (minbatch, average_cross_entropy, exp(average_cross_entropy), samples_per_second))    # Number of samples we measured. First two minbatches were ignored    samples_measured = Param.minibatch_size * (Param.num_minbatches - 2)    overall_samples_per_second = samples_measured / t_total    return (minbatch_data, cross_entropy_data, overall_samples_per_second) def compute_average_cross_entropy(softmax_input):    vectors, indices = get_random_one_hot_data(Param.test_set_size)    total_cross_entropy = 0.0    arguments = (vectors)    z = softmax_input.eval(arguments).reshape(Param.test_set_size, Param.num_classes)    for i in range(len(indices)):        log_p = log_softmax(z[i], indices[i])        total_cross_entropy -= log_p    return total_cross_entropy / len(indices)# Computes log(softmax(z,index)) for a one-dimensional numpy array z in an numerically stable way.def log_softmax(z,    # numpy array                index # index into the array            ):    max_z = np.max(z)    return z[index] - max_z - log(np.sum(np.exp(z - max_z)))np.random.seed(1)print("start...")train(do_print_progress = True)print("done.")

在上面的代码中，我们使用了两种不同的方法来展示训练进度：

使用计算完整softmax平均交叉熵的函数
使用CNTK内部的ProgressPrinter

ProgressPrinter向我们展示训练过程中准则函数的值变化情况。在本例中，准则函数是基于取样softmax的交叉熵成本函数。

因为ProgressPrinter已经展示了我们的训练运行的状况，如果我们想比较不同取样方案，我们就不能依赖于只计算取样分类的数据集来的数据。

重要性取样

一般来说我们没有在各个分类上呈现随机分布的数据。一个典型的例子是当我们的输出类别是一个个单词时，单词the出现的评论会大大高于其他的单词。

在这样的例子中，我们一般在取样softmax时使用非随机分布，取而代之的是根据经常出现的类其权重取样，也叫重要性取样。在下面的代码中，取样分布的工作由数组softmax_sampling_weights控制。

我们选定Zipf分布作为我们的例子：

p [i] \propto 1 / (i + 5)

我们将随机分布取样转换成zipf分布，代码如下：

# We want to lot the data import matplotlib.pyplot as plt# Define weights of zipfian distributuiondef zipf(index):    return 1.0 / (index + 5)# Use zipifian distribution for the classesdef zipf_sampling_weights():    return np.asarray([ zipf(i) for i in range(Param.num_classes)], dtype=np.float32)data_sampling_distribution = lambda: zipf_sampling_weights() / np.sum(zipf_sampling_weights())print("start...")# Train using uniform sampling (like before)np.random.seed(1)softmax_sampling_weights = lambda: np.repeat(1.0/Param.num_classes, Param.num_classes)minibatch_data, cross_entropy_data, _ = train(do_print_progress = False)# Train using importance samplingnp.random.seed(1)softmax_sampling_weights = zipf_sampling_weightsminibatch_data2, cross_entropy_data2, _ = train(do_print_progress = False)plt.plot(minibatch_data, cross_entropy_data, 'r--',minibatch_data, cross_entropy_data2, 'b--')plt.xlabel('number of mini-batches')plt.ylabel('cross entropy')plt.show()

结果图片

在上面的例子中我们比较了随机分布取样（红色）和zipf取样（蓝色），你需要自己去试最佳的softmax参数。

什么东西加速训练到预期

完整的softmax和取样softmax的训练速度的区别由具体的参数来决定，比如：

分类的数量。通常加速会增加更多的输出类
在使用采样softmax时的样本数
隐藏层的输入大小
取样包大小
硬件

此外你还需要测试一下你能在不降低训练效果的前提下能减少多少样本量。

print("start...")# Reset parametersclass Param:    # Learning parameters    learning_rate = 0.03    minibatch_size = 8    num_minbatches = 100    test_set_size = 1 # we are only interrested in speed    momentum_time_constant = 5 * minibatch_size    # Switch off reporting to speed up    reporting_interval = 1000000     allow_duplicates = False    # Parameters for sampled softmax    use_sampled_softmax = True    use_sparse = True    softmax_sample_size = 10    # Details of data and model    num_classes = 50000    hidden_dim = 10data_sampling_distribution = lambda: np.repeat(1.0 / Param.num_classes, Param.num_classes)softmax_sampling_weights = lambda: np.repeat(1.0 / Param.num_classes, Param.num_classes)sample_sizes = [5, 10, 100, 1000]speed_with_sampled_softmax = []# Get the speed with sampled softmax for different sizesfor sample_size in sample_sizes:     print("Measuring speed of sampled softmax for sample size %d ..." % (sample_size))    Param.use_sampled_softmax = True    Param.softmax_sample_size = sample_size    _, _,  samples_per_second = train(do_print_progress = False)    speed_with_sampled_softmax.append(samples_per_second)# Get the speed with full softmaxParam.use_sampled_softmax = Falseprint("Measuring speed of full softmax ...")_, _,  samples_per_second = train(do_print_progress = False)speed_without_sampled_softmax = np.repeat(samples_per_second, len(sample_sizes))# Plot the speed of sampled softmax (blue) as a function of sample sizes# and compare it to the speed with full softmax (red).    plt.plot(sample_sizes, speed_without_sampled_softmax, 'r--',sample_sizes, speed_with_sampled_softmax, 'b--')plt.xlabel('softmax sample size')plt.ylabel('speed: instances / second')plt.title("Speed 'sampled softmax' (blue) vs. 'full softmax' (red)")plt.ylim(ymin=0)plt.show()

结果图片

欢迎扫码关注我的微信公众号获取最新文章

阅读全文

0 0