CNTK API文档翻译(22)——取样Softmax函数
来源:互联网 发布:淘宝快递合作价格 编辑:程序博客网 时间:2024/06/06 11:50
在分类和预测的问题中,一个典型的准则函数是使用softmax的交叉熵成本函数。如果输出的分类值的数据很大,准则函数和相应参数的计算量可能会相当巨大。取样Softmax函数可能是加速训练的一个方向。
选择代码运行环境
在我们进入问题之前,先引入一些需要的库和做一些环境设置。
# Use a function definition from future version (say 3.x from 2.7 interpreter)from __future__ import print_function from __future__ import divisionimport osimport cntk as Cimport cntk.tests.test_utils# (only needed for our build system)cntk.tests.test_utils.set_device_from_pytest_env() # fix a random seed for CNTK componentsC.cntk_py.set_fixed_random_seed(1)
基础
Softmax函数在神经网络中主要用于我们想将神经网络的输出值表示基于大小是
Softmax函数将一个
在这个公式里面,我们设softmax函数的输入值
其中W是一个可以训练的大小为
在一个典型的实际案例比如一个递归语言处理模型中,隐藏向量
作为训练的准则函数,我们选定交叉熵成本函数,他是使用类别
外部取样Softmax函数
要是用普通的softmax函数,CNTK的Python API提供了cross_entropy_with_softmax函数。他以
下面我们展示了代码cross_entropy_with_sampled_softmax_and_embedding,我们先看他是如何声明的。
与Python API提供的函数最基本的不同是在Python API的函数中,输入矢量
我们还有一些参数num_samples,sampling_weights,allow_duplicates用来控制随机取样。另一个与API中提供的函数不同点是我们返回了一个元组(z, cross_entropy_on_samples, error_on_samples)。
贴完代码后我们将讲述实现的细节。
# Creates a subgraph computing cross-entropy with sampled softmax.def cross_entropy_with_sampled_softmax_and_embedding( hidden_vector, target_vector, num_classes, hidden_dim, num_samples, sampling_weights, allow_duplicates = True ): # define the parameters learnable parameters b = C.Parameter(shape = (num_classes, 1), init = 0) W = C.Parameter(shape = (num_classes, hidden_dim), init = C.glorot_uniform()) # Define the node that generates a set of random samples per minibatch # Sparse matrix (num_samples * num_classes) sample_selector = C.random_sample(sampling_weights, num_samples, allow_duplicates) # For each of the samples we also need the probablity that it in the sampled set. # dense row [1 * vocab_size] inclusion_probs = C.random_sample_inclusion_frequency(sampling_weights, num_samples, allow_duplicates) # dense row [1 * num_classes] log_prior = C.log(inclusion_probs) # Create a submatrix wS of 'weights # [num_samples * hidden_dim] W_sampled = C.times(sample_selector, W) z_sampled = C.times_transpose(W_sampled, hidden_vector) + C.times(sample_selector, b) - C.times_transpose (sample_selector, log_prior)# [num_samples] # Getting the weight vector for the true label. Dimension hidden_dim # [1 * hidden_dim] W_target = C.times(target_vector, W) z_target = C.times_transpose(W_target, hidden_vector) + C.times(target_vector, b) - C.times_transpose(target_vector, log_prior) # [1] z_reduced = C.reduce_log_sum_exp(z_sampled) # Compute the cross entropy that is used for training. # We don't check whether any of the classes in the random samples conincides with the true label, so it might # happen that the true class is counted # twice in the normalising demnominator of sampled softmax. cross_entropy_on_samples = C.log_add_exp(z_target, z_reduced) - z_target # For applying the model we also output a node providing the input for the full softmax z = C.times_transpose(W, hidden_vector) + b z = C.reshape(z, shape = (num_classes)) zSMax = C.reduce_max(z_sampled) error_on_samples = C.less(z_target, zSMax) return (z, cross_entropy_on_samples, error_on_samples)
为了让读者对我们的函数和普通的softmax函数的输入输出值的不同点有个直观了解,我们以上面代码的方式实现了一次普通的softmax函数。
# Creates subgraph computing cross-entropy with (full) softmax.def cross_entropy_with_softmax_and_embedding( hidden_vector, # Node providing hidden input target_vector, # Node providing the expected labels (as sparse vectors) num_classes, # Number of classes hidden_dim # Dimension of the hidden vector ): # Setup bias and weights b = C.Parameter(shape = (num_classes, 1), init = 0) W = C.Parameter(shape = (num_classes, hidden_dim), init = C.glorot_uniform()) z = C.reshape( C.times_transpose(W, hidden_vector) + b, (1, num_classes)) # Use cross_entropy_with_softmax cross_entropy = C.cross_entropy_with_softmax(z, target_vector) zMax = C.reduce_max(z) zT = C.times_transpose(z, target_vector) error_on_samples = C.less(zT, zMax) return (z, cross_entropy, error_on_samples)
你能看出我们实现的函数和api提供的函数的不同点:
- 我们的函数中包含了
z=Wh+b 。 - 我们返回了一个元组(z, cross_entropy, error_on_samples) 而不是只返回交叉熵。
一个小例子
为了解释如何完整的使用取样softmax,让我们来看一个小例子。在这个示例中我们先将输入的一位有效码矢量通过随机映射转换成一个低维矢量
import numpy as npfrom math import log, exp, sqrtfrom cntk.logging import ProgressPrinterimport timeit# A class with all parametersclass Param: # Learning parameters learning_rate = 0.03 minibatch_size = 100 num_minbatches = 100 test_set_size = 1000 momentum_time_constant = 5 * minibatch_size reporting_interval = 10 allow_duplicates = False # Parameters for sampled softmax use_sampled_softmax = True use_sparse = True softmax_sample_size = 10 # Details of data and model num_classes = 50 hidden_dim = 10data_sampling_distribution = lambda: np.repeat(1.0 / Param.num_classes, Param.num_classes)softmax_sampling_weights = lambda: np.repeat(1.0 / Param.num_classes, Param.num_classes)# Creates random one-hot vectors of dimension 'num_classes'.# Returns a tuple with a list of one-hot vectors, and list with the indices they encode.def get_random_one_hot_data(num_vectors): indices = np.random.choice( range(Param.num_classes), size=num_vectors, p = data_sampling_distribution()).reshape((num_vectors, 1)) list_of_vectors = C.Value.one_hot(indices, Param.num_classes) return (list_of_vectors, indices.flatten())# Create a network that:# * Transforms the input one hot-vectors with a constant random embedding# * Applies a linear decoding with parameters we want to learndef create_model(labels): # random projection matrix random_data = np.random.normal(scale = sqrt(1.0/Param.hidden_dim), size=(Param.num_classes, Param.hidden_dim)).astype(np.float32) random_matrix = C.constant(shape = (Param.num_classes, Param.hidden_dim), value = random_data) h = C.times(labels, random_matrix) # Connect the latent output to (sampled/full) softmax. if Param.use_sampled_softmax: sampling_weights = np.asarray(softmax_sampling_weights(), dtype=np.float32) sampling_weights.reshape((1, Param.num_classes)) softmax_input, ce, errs = cross_entropy_with_sampled_softmax_and_embedding( h, labels, Param.num_classes, Param.hidden_dim, Param.softmax_sample_size, softmax_sampling_weights(), Param.allow_duplicates) else: softmax_input, ce, errs = cross_entropy_with_softmax_and_embedding( h, labels, Param.num_classes, Param.hidden_dim) return softmax_input, ce, errsdef train(do_print_progress): labels = C.input_variable(shape = Param.num_classes, is_sparse = Param.use_sparse) z, cross_entropy, errs = create_model(labels) # Setup the trainer learning_rate_schedule = C.learning_rate_schedule(Param.learning_rate, C.UnitType.sample) momentum_schedule = C.momentum_as_time_constant_schedule(Param.momentum_time_constant) learner = C.momentum_sgd(z.parameters, learning_rate_schedule, momentum_schedule, True) progress_writers = None if do_print_progress: progress_writers = [ProgressPrinter(freq=Param.reporting_interval, tag='Training')] trainer = C.Trainer(z, (cross_entropy, errs), learner, progress_writers) minbatch = 0 average_cross_entropy = compute_average_cross_entropy(z) # store minibatch values minbatch_data = [0] # store cross_entropy values cross_entropy_data = [average_cross_entropy] # Run training t_total= 0 # Run training for minbatch in range(1,Param.num_minbatches): # Specify the mapping of input variables in the model to actual minibatch data to be trained with label_data, indices = get_random_one_hot_data(Param.minibatch_size) arguments = ({labels : label_data}) # If do_print_progress is True, this will automatically print the progress using ProgressPrinter # The printed loss numbers are computed using the sampled softmax criterion t_start = timeit.default_timer() trainer.train_minibatch(arguments) t_end = timeit.default_timer() t_delta = t_end - t_start samples_per_second = Param.minibatch_size / t_delta # We ignore the time measurements of the first two minibatches if minbatch > 2: t_total += t_delta # For comparison also print result using the full criterion if minbatch % Param.reporting_interval == int(Param.reporting_interval/2): # memorize the progress data for plotting average_cross_entropy = compute_average_cross_entropy(z) minbatch_data.append(minbatch) cross_entropy_data.append(average_cross_entropy) if do_print_progress: print("\nMinbatch=%d Cross-entropy from full softmax = %.3f perplexity = %.3f samples/s = %.1f" % (minbatch, average_cross_entropy, exp(average_cross_entropy), samples_per_second)) # Number of samples we measured. First two minbatches were ignored samples_measured = Param.minibatch_size * (Param.num_minbatches - 2) overall_samples_per_second = samples_measured / t_total return (minbatch_data, cross_entropy_data, overall_samples_per_second) def compute_average_cross_entropy(softmax_input): vectors, indices = get_random_one_hot_data(Param.test_set_size) total_cross_entropy = 0.0 arguments = (vectors) z = softmax_input.eval(arguments).reshape(Param.test_set_size, Param.num_classes) for i in range(len(indices)): log_p = log_softmax(z[i], indices[i]) total_cross_entropy -= log_p return total_cross_entropy / len(indices)# Computes log(softmax(z,index)) for a one-dimensional numpy array z in an numerically stable way.def log_softmax(z, # numpy array index # index into the array ): max_z = np.max(z) return z[index] - max_z - log(np.sum(np.exp(z - max_z)))np.random.seed(1)print("start...")train(do_print_progress = True)print("done.")
在上面的代码中,我们使用了两种不同的方法来展示训练进度:
- 使用计算完整softmax平均交叉熵的函数
- 使用CNTK内部的ProgressPrinter
ProgressPrinter向我们展示训练过程中准则函数的值变化情况。在本例中,准则函数是基于取样softmax的交叉熵成本函数。
因为ProgressPrinter已经展示了我们的训练运行的状况,如果我们想比较不同取样方案,我们就不能依赖于只计算取样分类的数据集来的数据。
重要性取样
一般来说我们没有在各个分类上呈现随机分布的数据。一个典型的例子是当我们的输出类别是一个个单词时,单词the出现的评论会大大高于其他的单词。
在这样的例子中,我们一般在取样softmax时使用非随机分布,取而代之的是根据经常出现的类其权重取样,也叫重要性取样。在下面的代码中,取样分布的工作由数组softmax_sampling_weights控制。
我们选定Zipf分布作为我们的例子:
我们将随机分布取样转换成zipf分布,代码如下:
# We want to lot the data import matplotlib.pyplot as plt# Define weights of zipfian distributuiondef zipf(index): return 1.0 / (index + 5)# Use zipifian distribution for the classesdef zipf_sampling_weights(): return np.asarray([ zipf(i) for i in range(Param.num_classes)], dtype=np.float32)data_sampling_distribution = lambda: zipf_sampling_weights() / np.sum(zipf_sampling_weights())print("start...")# Train using uniform sampling (like before)np.random.seed(1)softmax_sampling_weights = lambda: np.repeat(1.0/Param.num_classes, Param.num_classes)minibatch_data, cross_entropy_data, _ = train(do_print_progress = False)# Train using importance samplingnp.random.seed(1)softmax_sampling_weights = zipf_sampling_weightsminibatch_data2, cross_entropy_data2, _ = train(do_print_progress = False)plt.plot(minibatch_data, cross_entropy_data, 'r--',minibatch_data, cross_entropy_data2, 'b--')plt.xlabel('number of mini-batches')plt.ylabel('cross entropy')plt.show()
结果图片
在上面的例子中我们比较了随机分布取样(红色)和zipf取样(蓝色),你需要自己去试最佳的softmax参数。
什么东西加速训练到预期
完整的softmax和取样softmax的训练速度的区别由具体的参数来决定,比如:
- 分类的数量。通常加速会增加更多的输出类
- 在使用采样softmax时的样本数
- 隐藏层的输入大小
- 取样包大小
- 硬件
此外你还需要测试一下你能在不降低训练效果的前提下能减少多少样本量。
print("start...")# Reset parametersclass Param: # Learning parameters learning_rate = 0.03 minibatch_size = 8 num_minbatches = 100 test_set_size = 1 # we are only interrested in speed momentum_time_constant = 5 * minibatch_size # Switch off reporting to speed up reporting_interval = 1000000 allow_duplicates = False # Parameters for sampled softmax use_sampled_softmax = True use_sparse = True softmax_sample_size = 10 # Details of data and model num_classes = 50000 hidden_dim = 10data_sampling_distribution = lambda: np.repeat(1.0 / Param.num_classes, Param.num_classes)softmax_sampling_weights = lambda: np.repeat(1.0 / Param.num_classes, Param.num_classes)sample_sizes = [5, 10, 100, 1000]speed_with_sampled_softmax = []# Get the speed with sampled softmax for different sizesfor sample_size in sample_sizes: print("Measuring speed of sampled softmax for sample size %d ..." % (sample_size)) Param.use_sampled_softmax = True Param.softmax_sample_size = sample_size _, _, samples_per_second = train(do_print_progress = False) speed_with_sampled_softmax.append(samples_per_second)# Get the speed with full softmaxParam.use_sampled_softmax = Falseprint("Measuring speed of full softmax ...")_, _, samples_per_second = train(do_print_progress = False)speed_without_sampled_softmax = np.repeat(samples_per_second, len(sample_sizes))# Plot the speed of sampled softmax (blue) as a function of sample sizes# and compare it to the speed with full softmax (red). plt.plot(sample_sizes, speed_without_sampled_softmax, 'r--',sample_sizes, speed_with_sampled_softmax, 'b--')plt.xlabel('softmax sample size')plt.ylabel('speed: instances / second')plt.title("Speed 'sampled softmax' (blue) vs. 'full softmax' (red)")plt.ylim(ymin=0)plt.show()
结果图片
欢迎扫码关注我的微信公众号获取最新文章
- CNTK API文档翻译(22)——取样Softmax函数
- CNTK API文档翻译(12)——CNTK进阶
- CNTK API文档翻译(1)——使用数列
- CNTK API文档翻译(2)——逻辑回归
- CNTK API文档翻译(3)——前馈神经网络
- CNTK API文档翻译(4)——MNIST数据加载
- CNTK API文档翻译(14)——实验图像识别
- CNTK API文档翻译(15)——自然语言理解
- CNTK API文档翻译(16)——增强学习基础
- CNTK API文档翻译(19)——艺术风格转变
- CNTK API文档翻译(25)——后记
- CNTK API文档翻译(5)——对MNIST数据使用逻辑回归
- CNTK API文档翻译(6)——对MNIST数据使用多层感知机
- CNTK API文档翻译(7)——对MNIST数据使用卷积神经网络
- CNTK API文档翻译(9)——使用自编码器压缩MNIST数据
- CNTK API文档翻译(10)——使用LSTM预测时间序列数据
- CNTK API文档翻译(13)——CIFAR-10数据准备
- CNTK API文档翻译(17)——多对多神经网络处理文本数据(1)
- 九九乘法表(java算法)
- 选用英文字体输入中文自动选用中文字体相关
- MySQL的坑 Can't find file: './mysql/plugin.frm' 问题
- C语言获取本地所有网卡的ip地址及MAC信息
- java 古典兔子斐波那契数列
- CNTK API文档翻译(22)——取样Softmax函数
- 艳辉音乐阁
- POJ 计算几何入门题目推荐
- 求有向图强联通分量--Tarjan算法
- 面试题41(2). 和为S的连续正数序列
- 《C++ Concurrency in Action》笔记5 std::thread::id
- WPF在Canvas中绘图实现折线统计图
- 201709 半集训
- codevs 1228(DFS序+线段树/树状数组)