Tandem Features or Bottleneck Features

来源：互联网发布：国家旅游数据统计编辑：程序博客网时间：2024/05/17 02:07

这两个词刚看到的时候没反应过来是什么意思，在 Deep Neural Network based Text-Dependent Speaker Recognition:Preliminary Results 这篇文章中，原文如下：
Another approach that makes use of a phonetic discriminant DNN for speaker verification is the so-called bottleneck or tandem features approach [8]. A DNN is trained in supervised mode using the triphone state labels as targets. Once the net-work is trained, deep features can be extracted for every speechframe of a recording. The dimension of the deep feature is usually kept the same as the spectral feature by means of a bottleneck layer. These features in turn are used to train a backend classifier like a GMM-UBM or PLDA. Tandem feature are formed by combining the deep feature and the spectral feature corresponding to a given speech frame.
Tandem features have been successfully applied to both text-independent and text-dependent speaker verification [9, 10]. Inthe text-dependent case, the approach was tested on the multiplepass-phrase task. To our knowledge this works represents thesmallest dataset used for training neural net models.

[8] TianfanFu,YanminQian,YuanLiu,andKaiYu,“Tandemdeep features for text-dependent speaker verification.,” inINTERSPEECH, 2014, pp. 1327–1331.
瓶颈特征或者叫串联特征，前者是形象的说法，一般输入的维度要大于隐层的维度，通常是在最后一个隐层的时候减小到我们想要的维度，

图片来自［8］，下面接结合这个图说一下对tandem-feature 的理解:
input layer： right-context=5 + left-context=5 + current frame =1 = 11 frame
hidden layer: 如果只是为了提取特征那可以直接pre-trian, 最后一个隐层降到想要的维度，但是这篇文章中：
All the deep models have 7 hidden layers with 1024 nodes per layer. 实验结果是从第二层或者第四层取值做PCA 的效果比较好，这一部分叫deep feature ，然后和原始的current frame 的39维PLP feature,做一个concatenate,组成78维的tandem-deep-feature, 然后送到GMM-UBM中训练UBM模型，而传统的都是直接用PLP 或者 MFCC ＋ VAD特征去训练UBM模型.
output layer:是number of speakers, 比如在RSR2015数据集，用bkg和dev来训练DNN，在bkg和dev集中共194 speakers，所有speaker label应该是194维的01序列，然后就是phone label应该是对应的triphone的number, 这个集合中用GMM对齐的tied-triphone-states = 3001. 原文如下（有删减）：
The state alignment for the phone DNN training was generated using a GMM model with 3001 tiedtriphone-states, which is built on a 50-hour SWB English task and 194 classes (194 speakers in bkg and dev set) are used in the speaker DNN training.
所以有如果是这种结构的话，输出节点个数应该是(194+3001=3195个节点).

上图是一个phone + speaker DNN training model, 当然可以phone DNN training model 和 speaker DNN training model 可以单独用，在［8］ 中的实验结果可以看出，Once the neural network training process is finished, the output layers of the two multi-task joint-learned DNNs can be removed, and the rest of each of the neural networks (common hidden layers) is used to extract the speaker-text joint representative features.

0 0