DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense 解析与实验

来源：互联网发布：淘宝店铺优惠券在哪里编辑：程序博客网时间：2024/05/21 14:44

最近想压缩一个模型，但是网上居然没有这个文章的解析。这里，我写一下压缩神经网络的理解与实验过程。

关键部分译文

【原文】The downside of such large models is that they are prone to capturing the noise, rather than the intended pattern, in the training dataset. This noise does not generalize to new datasets, leading to over-fitting and a high variance.

【译文】大型模型的缺点是它们容易在训练数据集中捕获噪声而不是预期的模式。这种噪声不会推广到新的数据集，导致过度拟合和高方差。

【原文】 The first D step learns the connectivity via normal network training on the dense network. Unlike conventional training, however, the goal of this D step is not to learn the final values of the weights; rather, we are learning which connections are important.

【译文】第一个D步骤：如同传统的训练方式，训练一个稠密网络。然而，与传统训练不同，这个D步骤的目标不是学习权重的最终值; 而是学习那些链接的权值是重要的。

【原文】 The S step prunes the low-weight connections and retrains the sparse network. All connections with weights below a threshold are removed from the network, converting a dense network into a sparse network. This truncation-based procedure has provable advantage in statistical accuracy in comparison with their non-truncated counterparts .

【译文】S步骤修剪低权值连接并重新训练稀疏网络。将权值低于阈值的所有连接从网络中删除(注意：每一个layer都有不同的阈值)，将稠密网络转换为稀疏网络。与其他非截断的方法（比如：L2-正则化），这种基于截断的方法可以证明有优势。这种优势是通过统计准确率体现出来的。

（说白了，就是从统计在测试集上表现出来的结果来看，基于截断的方法比非截断的表现的好，实际上，在我看来，这种剪枝技术，有点像全连接层后面的dropout，他就是比L2正则化更加强硬的手段，L2是减小权值，这个是直接去掉小的权重链接，这样来防止过拟合）

【原文】 The final D step recovers the pruned connections, making the network dense again. These previously-pruned connections are initialized to zero and retrained with 1/10 the original learning rate (since the sparse network is already at a good local minima). Dropout ratios and weight decay remained unchanged. Restoring the pruned connections increases the dimensionality of the network, and more parameters make it easier for the network to slide down the saddle point to arrive at a better local minima. This step adds model capacity and lets the model have less bias.

【译文】最后这个D步骤：将上一步剪枝减掉的权重重新链接，使网络重新变得稠密。训练的设置：权重初始化分两部分，上一步S训练的稀疏网络，那些系数不变，而新加的这些链接，权重初始化为0（相当于网络调优），此时学习率设置，是上一步的1/10。dropout ratio与 weight_decay 不变。恢复修剪的连接增加了网络的维度，并且更多的参数使得网络更容易地向下滑动鞍点以获得更好的局部最小值。此步骤增加了模型容量，并使模型具有较少的偏差。

一图解千言

这个图：表示了权重在DSD过程中的变化情况：#a是第一个D之后，权重分布。#b是网络剪枝之后的权重分布（绝对值剪枝，也就是小于某个绝对值的所有权重，都设置为0 ，因此关于0是对称的）#c是剪枝之后又开始训练。（靠近0的一侧开始变得soft，而不是直上直下的）#d是恢复那些被修剪的权值，设置为0，可以看到，0处出现了一个很细的线。#e是对d模型开始训练。d中的weight双峰分布几乎没变。但是中间接近0的部分变宽了

这个图显示了算法的主要流程。可以发现：#1 - 1，3步骤的dense训练过程一样#2 - 2 步骤的过程（lamda）指的是每层裁剪权值设定的阈值。小于该#阈值都设置为0，#然后参数矩阵 * Mask

本文持续更新~

阅读全文

0 0