声学模型训练----Acoustic Modeling

来源:互联网 发布:autodesk 打印 软件 编辑:程序博客网 时间:2024/05/16 01:49

General Framework for AM:


Building ASR system incrementally
Context-independent ➔ Context-dependent modeling 上下文无关文本➔上下文相关模型
Mono-phone ➔ Tri-phone HMM  单音素➔三音素
Single Gaussian mixture per state ➔ Multiple Gaussian mixtures per state 单高斯➔混合高斯

Data Preparation:



Acoustic Unit Selection:

Criteria
Accurate:
accurately represent the acoustic realization that appears in different contexts
Trainable: have enough data to estimate the parameters of the unit
Generalizable: any new word can be derived from a predefined unit inventory for task-independent speech recognition

标准
准确性:准确地表示出现在不同上下文中的声学实现
可训练的:有足够的数据估计参数
可概括的:可以从任务无关语音识别的预定单位清单中导出任何新单词


Units available

  • Word
  • Syllable  音节
  • Initial/Final (Chinese-specific)
  • Phoneme   音素

Word

60,000

Syllable

420(1200+ with tone)

Initial/Final

60(22+38)

Phoneme

39

Note: 97 tone-dependent phonemes are designed by Microsoft


What is Phoneme and Phone:

Phoneme(phn)
Denote the minimal units of speech sound in a language  语音的最小单位
Serve to distinguish one word from another  用来区分一个单词和另一个单词
pat vs. bat


Phone
denote a phoneme’s acoustic realization  音素的声学实现
dependent on gender, speech rate, context, accent etc.  取决于性别,语度,语境,口音等。
sat and meter (distinct phones)


Context-independent Modeling :





Context-dependent Modeling:

Why context-dependent?
Co-articulation(协同发音)
The process by which neighboring sounds influence one another is called co-articulation 相邻声音相互影响的过程称为共同关系

Why triphone?
Only immediately proceeding and succeeding phonemes are taken into account  只有立即进行和后续的音素被考虑在内
Compromise between performance, complexity and data available.    性能,复杂性和可用数据之间的妥协。
Most widely used acoustic unit for CD modeling.   用于CD建模的最广泛使用的声学单元。


some issue:

How to obtain triphone-based transcription? 如何获取基于triphone的转录?

Based on monophone-based transcription  基于单音素转录

Taking neighboring phonemes into account 考虑相邻的音素

How to deal with data insufficiency? 如何处理数据不足?
possible triphone combinations  三音素组合
Some with sufficient occurrences can be robustly modeled 一些有足够的事件可以强大地建模
Some with fewer occurrences may be poorly modeled. 发生次数较少的部分可能模拟不佳。
Some may never occur in training data  有些可能永远不会发生在训练数据中


How to deal with data insufficiency?
Sharing (Tying) strategy  共享(绑定)策略
Model level sharing   模型级共享
State level sharing  
Mixture level sharing 混合水平共享
Transition matrix sharing 转换矩阵共享
Mean/Variance sharing 平均值/方差分布
How to increase the numbers of Gaussian Mixture per state 如何增加每个状态的高斯混合数
Mixture splitting based on single Gaussian mixture 基于单一高斯混合的混合分裂
Iterations is necessary after each splitting 每次拆分后都需要迭代


Transcription and HMM list:



HMM list is composed of unique triphones
At present, all the triphones appear in training data


Sharing Strategies:分享策略:

Transition Matrix Sharing 转换矩阵共享
Assumption 假设
Transition matrix plays a less significant role in performance  转换矩阵在性能上起着不太重要的作用
Solution 解
All the Transition matrix from triphones with identical central unit are shared  来自具有相同中央单元的三音素的所有转换矩阵被共享
Ti  T_ah {*-ah+*.Transp} 
That is, the number of transition matrix is equal to number of monophones 也就是说,过渡矩阵的数量等于单声道的数量

State Sharing
Classification method 分类方法
Decision Tree-A classification method 决策树A分类方法
Goal: Merge the similar states of related models while keep the dissimilar states distinct 目标:合并相关模型的类似状态,同时保持不同的状态不同
Solution
Phonetic-related 语音相关
State-dependent 状态依赖
Language-dependent 与语言相关的
Question set 问题集


Decision-tree based state clustering:






Model Sharing
To deal with triphone combinations that never occur in training data
Descending the previously constructed trees for that phone and answering the questions at each node based on the new unseen context

根据新的看不见的上下文,降低先前构建的那部phone树,并回答每个节点的问题
sh-ang+s (not existing) : sh-ang+sh (existing)

Mixture Increment:

LVCSR typically consists of multiple mixture component context-dependent HMMs 对于三音HMM,在一个状态下8〜14高斯混合是理想的
Till now, Triphone HMMs with single Gaussian mixture per state were used in different sharing strategies 状态中的高斯混合过多可能会导致数据不足
A mechanism, called mixture splitting, can increase the number of mixtures within a state 在Autotrain中平均采用了一种状态下的12(8〜14)个高斯混合
The approach (mixture splitting) is extremely flexible since it allows the number of mixture to be repeatedly increased until the desired performance is achieved
Some iterations are necessary after mixture splitting is performed 
该方法(混合物分离)是非常灵活的,因为它允许混合物的数量重复增加,直到达到期望的性能

混合分裂后,需要进行一些迭代

Flowchart for CD Modeling: