神经网络压缩：Mimic（一）Do Deep Nets Really Need to be Deep

来源：互联网发布：java string变量类型编辑：程序博客网时间：2024/05/18 03:28

Do Deep Nets Really Need to be Deep?

论文链接： http://arxiv.org/abs/1312.6184

文章主旨
Shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models.
文章采用一种模型压缩（model compression）的方法模拟深度网络训练浅层网络，新的浅层模型的准确率能够达到和深度模型几乎一样的效果。（但直接训练浅层网络得到的准确率和深度网络还是没法比的）
Train Shallow Nets to Mimic Deep Nets
浅层网络的训练是通过两步得到的：
- train a state-of-the-art deep model
- train a shallow model to mimic the deep model
  采用训练好的深度网络和 unlabeled data 共同训练浅层网络。这个浅层网络（mimic model）不是直接在原始数据及标签上训出来的，而是通过将数据输入深度网络，来学习深度网络已经学习到的函数。文章后面的实验中unlabeled data是将原始数据标签丢弃得到的，但要注意：第一，unlabeled的样本最好不能只是将deep model的训练集中的label去掉而得到的样本，因为deep model往往在这些样本点上有overfitting；第二，unlabel的样本数需要比deep model的训练集中的样本数要多得多，这样才能更可能的近似原来的这个deep model，unlabeled set 比 train set更大时会work best。
  文章中提到参考文献[3](Model compression,2006) 用一种模型压缩的方法将复杂的集成模型转化成单层神经网络（训练了一个single neural net of modest size，to mimic a much larger ensemble of models），还提出了三种获取unlabeled data的方法。
  当复杂模型可以被浅层模型来模拟时，就说明复杂模型学习到的函数并不是真正复杂的。模型的复杂性，和模型表达能力的复杂度是两回事。
Mimic Learning
正常网络是trained with cross-entropy on the n p values—— ，也就是softmax的输出；现在mimic model是直接在n log probability values z 上去train，文章里把softmax之前的这部分叫logit：
- By training the student model on the logits directly, the student is better able to learn the internal model learned by the teacher, without suffering from the information loss that occurs after passing through the logits to probability space；
- 就是说监督teacher和student model的softmax上一层，学也是从这上一层开始学，求loss；
- 公式
- Note：Normalization not crucial 但还是有用的；
Speed-up
By introducing a linear layer
- 模拟模型层数少，但节点多，输入层和隐藏层之间的参数量O(HD)，D is input feature dimension and H is the number of hidden units，运算量会比较大比较慢，收敛的也慢，因此在输入层和隐藏层之间加入一个线性层（含有k个linear hidden units），由于线性层可以被吸收到权重矩阵中，所以加入线性层之后，新的模型具备和原来一样的表达能力，而且此时参数量O(k(H+D))，减小了很多，加速收敛。
- 这样重新参数化权重矩阵不仅提高了收敛速度，也大大降低了内存空间，这样也就允许训练更大的浅层网络
Results-Discussion
- 具体Results就看论文就好了
- Discussion：这样用teacher model来train student model 可以排除一些错误标签的影响； student model 可以学习到soft labels；mimic model sees non-zero targets for most outputs on most training cases，学习这种不确定性，比单纯强制学习0/1更有意义；
- 文章另外还讨论了浅层模型的学习能力和表达能力，其实teacher model好shallow model就更容易好，总结有两点：shallow models with a number of parameters comparable to deep models are likely capable of learning even more accurate functions if a more accurate teacher and/or more unlabeled data became available；
- The results suggest that it may be possible to devise better learning algorithms for training more accurate shallow feed-forward nets than those currently in use.

0 0