SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORKS

来源：互联网发布：光纤网络哪家好编辑：程序博客网时间：2024/06/05 02:44

SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORKS学习摘要

版权声明：本文为博主原创文章，未经博主允许不得转载。

由于个人能力有限，所有理解可能有误，所以有什么问题，可以直接评论告诉我，我会及时更改，谢谢。非全文翻译，只是找了比较重要的地方翻译并加上了自己的理解。

论文来源：
这里写图片描述
在此致谢。

基于DNN的语音唤醒方法DEEP KWS SYSTEM，每10ms反馈一次，相比于原始的HMM方法，减少了反应时间，不用收录入所有的语音再做反应，也不用有decoding过程，减少了运算次数，减少了空间占用。

ABSTRACT
A deep neural network is trained to directly predict the keyword(s) or subword units of the keyword(s) followed by a posterior handling method producing a final confidence score.

INTRODUCTION
Keyword Spotting (KWS) aims at detecting predefined keywords in an audio stream. In this generative approach, an HMM model is trained for each keyword, and a filler（不需要的词） model HMM is trained from the non-keyword segments of the speech signal (fillers).We propose a simple discriminative KWS approach based on deep neural networks that is appropriate for mobile devices. We refer to it as Deep KWS . A deep neural network is trained to directly predict the keyword(s) or subword units of the keyword(s) followed by a posterior handling method producing a final confidence score.
DEEP KWS SYSTEM
The framework consists of three major components: (i) a feature extraction module, (ii) a deep neural network, and (iii) a posterior handling module. The feature extraction module (i) performs voice-activity detection and generates a vector of features every frame (10 ms).
特征提取层+深度神经网络（DNN）+后处理判别模型
We train a DNN (ii) to predict posterior probabilities for each output
label from the stacked features.
特征提取层生成的大向量是DNN的输入，DNN网络预测这些特征是什么类型的词的概率，是我们想要的关键词（keyword）还是我们不需要的词（filler）。
The posterior handling module combines these scores to provide a final confidence score for that window.
后处理模型综合这些概率，产生一个最终置信度，判断是keyword还是filler。
2.1. Feature Extraction（特征提取）
To reduce computation, we use a voice-activity detection system and only run the KWS algorithm in voice regions. The voice-activity
detector, described in [14], uses 13-dimensional PLP features and their deltas and double-deltas as input to a trained 30-component diagonal covariance GMM, which generates speech and non-speech posteriors at every frame. This is followed by a hand-tuned state machine (SM), which performs temporal smoothing by identifying regions where many frame speech posteriors exceed a threshold.
上面这些说的是一个在文献【14】中描述的一个voice-activity detector。
For the speech regions, we generate acoustic features based on 40-dimensional log-filterbank energies computed every 10 ms over
a window of 25 ms.
在语音方面，我们每10ms计算一次25ms窗格的40维的log-filterbank energies，然后基于这个生成语音特征。
For our Deep KWS system, we use 10 future frames
and 30 frames in the past. For the HMM baseline system we use 5
future frames and 10 frames in the past, as this provided the best
trade-off between accuracy, latency, and computation [15].
对于Deep KWS系统，我们用未来的10帧和过去的30帧，对于HMM基线系统，我们用未来的5帧和过去的10帧。
2.2. Deep Neural Network
DNN网络模型是全连接的有k个隐藏层，每个隐藏层有n个节点的模型。最后一层有个softmax输出每个输出标签的后验概率（posterior of each output label），中间隐藏层激活函数用rectified linear unit（ReLU）函数。网络大小要针对输出标签。
标记（Labeling）。
For our baseline HMM system, as in previous work [8, 9, 17] the labels in the output layer of the neural network are context dependent HMM states. More specifically the baseline system uses 2002 context dependent states selected as described in [15].
对于基线HMM系统，我们用文献【15】中描述的模型选出的2002个背景联系的状态。
For the proposed Deep KWS , we report results with full word labels. These labels are generated at training time via forced alignment
using our 50M parameter LVCSR system.
对于Deep KWS，我们只记录有完整词标签的结果。这些标签通过在训练过程中，用有50M参数的LVCSR系统强制校准生成。（论文中又讲了用完整词标签的好处,这里就不翻了)
训练（Training）
pij是第i个（0<=i<=n-1，i为整数，n是标签数量，0是非关键词的标签）标签和第j帧xj的是否一致后验概率。
The weights and biases of the deep neural network, θ, are estimated by maximizing the cross-entropy training criterion over the labeled training data {xj, ij}j (previous paragraph).
DNN的权重和偏差训练是为使基于 {xj, ij}j数据的交叉熵训练标准最大，交叉熵标准如下图。

We use asynchronous stochastic gradient descent with an exponential decay for the learning rate.
我们用学习率指数下降的随机梯度下降算法做参数优化。
转移学习（Transfer learning）
Transfer learning refers to the situation where (some of) the network parameters are initialized with the corresponding parameters of an existing network, and are not trained from scratch.
转移学习指的是网络参数用一个已经存在的网络的相应参数进行初始化的情况。
Here, we use a deep neural network for speech recognition with suitable topology to initialize the hidden layers of the network.
这里我们用了一个有着合理拓扑排序的语音识别的DNN网络去初始化我们网络的隐藏层参数。
这样能使训练出的结果更好更可信，还能避免只求出了局部最优问题。
2.3. 后验概率处理（Posterior Handling）
如何通过合并上一步得到的DNN后验概率去得到关键词的置信度（confidence
scores）（就是这个词是否是某个类型的总体分数）。如果这个分数超过了一个阈值（threshold），那么就可以判断关键词是否出现。下面只讲针对单个词的置信度的计算方法，实际上可以很容易的被改进成同时检查多个关键词的计算方法。
平滑后验概率（Posterior smoothing）

pij`是pij平滑后的结果，wsmooth是所取窗格的大小，hsmooth是这段窗格的第一帧的索引。
其实说了这么多，就相当于是取了这个窗格中的pij的平均，就算是平滑了。
置信度（Confidence）

第j帧的置信度在一个大小为wmax的窗格中如上如这么计算。我们用wsmooth = 30，wmax=100的窗格，这样效果最好。
3.基线HMM KWS系统（ BASELINE HMM KWS SYSTEM）
We implement a standard Keyword-Filler Hidden Markov Model as our baseline. The basic idea is to create a HMM for the keyword and a HMM to represent all non-keyword segments of the speech signal (filler model).
我们用一个标准的HMM模型作为我们的基线。主要思想就是分别生成一个关键字的HMM和一个非关键字的HMM模型。
We implemented a triphone-based HMM model as filler.
我们用一个基于triphone的HMM模型作为非关键字模型。

这个关键字-非关键字HMM模型如上图所示。论文写了好多这个模型的好处，我不翻了。还写了一下转移学习的好处。
实验结果（EXPERIMENTAL RESULTS）
We train a separate Deep KWS and build a separate Keyword-Filler HMM KWS system for each key-phrase. Results are presented in the form of a modified receiver operating characteristic (ROC) curves, where we replace true positive rate with the false reject rate on Y-axis. Lower curves are better.
我们对每个关键短语都训练了DEEP KWS和非关键词HMM KWS系统。结果通过modified receiver operating characteristic（ROC）曲线显示。横轴为虚警率纵轴为漏检率的一条曲线，越靠近原点越好。
论文里面说了多么多么好，他们的方法多靠近原点，多好。
4.1. 数据（Data）
略
4.2. 结果（Results）
Deep KW模型比baseline模型要好。
4.3. 模型大小（Model Size）
In this case the number of parameters for the baseline increases to 2.6M while the Deep models reach 2.1M.
4.4. 噪声鲁棒性（Noise Robustness）
略（总之说自己模型还不错，比原来的有改进）
总结（CONCLUSION）
总结和展望，略
致谢（ACKNOWLEDGEMENTS）
略

阅读全文

0 0