Tandem Features or Bottleneck Features

来源:互联网 发布:国家旅游数据统计 编辑:程序博客网 时间:2024/05/17 02:07

这两个词刚看到的时候没反应过来是什么意思,在 Deep Neural Network based Text-Dependent Speaker Recognition:Preliminary Results 这篇文章中,原文如下:
Another approach that makes use of a phonetic discriminant DNN for speaker verification is the so-called bottleneck or tandem features approach [8]. A DNN is trained in supervised mode using the triphone state labels as targets. Once the net-work is trained, deep features can be extracted for every speechframe of a recording. The dimension of the deep feature is usually kept the same as the spectral feature by means of a bottleneck layer. These features in turn are used to train a backend classifier like a GMM-UBM or PLDA. Tandem feature are formed by combining the deep feature and the spectral feature corresponding to a given speech frame.
Tandem features have been successfully applied to both text-independent and text-dependent speaker verification [9, 10]. Inthe text-dependent case, the approach was tested on the multiplepass-phrase task. To our knowledge this works represents thesmallest dataset used for training neural net models.

[8] TianfanFu,YanminQian,YuanLiu,andKaiYu,“Tandemdeep features for text-dependent speaker verification.,” inINTERSPEECH, 2014, pp. 1327–1331.
瓶颈特征或者叫串联特征,前者是形象的说法,一般输入的维度要大于隐层的维度,通常是在最后一个隐层的时候减小到我们想要的维度,
tandem-features
图片来自[8], 下面接结合这个图说一下对tandem-feature 的理解:
input layer: right-context=5 + left-context=5 + current frame =1 = 11 frame
hidden layer: 如果只是为了提取特征那可以直接pre-trian, 最后一个隐层降到想要的维度, 但是这篇文章中:
All the deep models have 7 hidden layers with 1024 nodes per layer. 实验结果是从第二层或者第四层取值做PCA 的效果比较好,这一部分叫deep feature ,然后和原始的current frame 的39维PLP feature,做一个concatenate,组成78维的tandem-deep-feature, 然后送到GMM-UBM中训练UBM模型,而传统的都是直接用PLP 或者 MFCC + VAD特征去训练UBM模型.
output layer:是number of speakers, 比如在RSR2015数据集,用bkg和dev来训练DNN,在bkg和dev集中共194 speakers,所有speaker label应该是194维的01序列,然后就是phone label应该是对应的triphone的number, 这个集合中用GMM对齐的tied-triphone-states = 3001. 原文如下(有删减):
The state alignment for the phone DNN training was generated using a GMM model with 3001 tiedtriphone-states, which is built on a 50-hour SWB English task and 194 classes (194 speakers in bkg and dev set) are used in the speaker DNN training.
所以有如果是这种结构的话,输出节点个数应该是(194+3001=3195个节点).

上图是一个phone + speaker DNN training model, 当然可以phone DNN training model 和 speaker DNN training model 可以单独用, 在[8] 中的实验结果可以看出,Once the neural network training process is finished, the output layers of the two multi-task joint-learned DNNs can be removed, and the rest of each of the neural networks (common hidden layers) is used to extract the speaker-text joint representative features.

0 0
原创粉丝点击