CMUSphinx Learn - Training Acoustic Model For CMUSphinx

来源：互联网发布：js重量级框架编辑：程序博客网时间：2024/06/05 00:47

Training Acoustic Model For CMUSphinx

为CMUSphinx训练声学模型

Introduction

简介

CMUSphinx project comes with several high-quality acoustic models. There are US English acoustic models for microphone and broadcast speech as well as a model for speech over a telephone. You can also use French or Chinese models trained on a huge amount of acoustic data. Those models were carefully optimized to achieve best recognition performance and work well for almost all applications. We spent years of our experience to make them perfect. Most command-and-control apps could use them directly as well as large vocabulary applications.

CMUSphinx工程有几个高质量的声学模型，有麦克风和广播语音的US英语声学模型，也有电话语音模型。你可以使用大数据量声学数据训练得到的法语或者汉语模型，这些模型都被优化达到最好的识别性能，并在几乎所有的应用中表现优异，我们数年时间来使它们更加完美，大部分的命令和控制应用可以直接使用它们，也可以使用大词汇量应用。

Besides models, CMUSphinx provides ways for adaptation which is sufficient for most cases when more accuracy is required. Adaptation is known to work well when you are using different recording environments (close-distance or far microphone or telephone channel), or when a slightly different accent is present (UK English or even Indian English) or even another language. Adaptation, for example, works well if you need to quickly add support for some new language just by mapping acoustic model phoneset to target phoneset with the dictionary.

除了模型，CMUSphinx提供了很多自适应的方法，当精度要求更高的场合下，这些方法是足够用的。自适应很适合在不同的录制环境下使用(离麦克风或电话声道或近或远)，或者当发出的口音稍微不同时(UK英语或意大利英语)，或者甚至是另一种语言。比如，你需要快速实现对一些新的语言的支持，自适应可以通过字典映射声学模型语言到目标语言。

There are, however, applications where the current models won't work. For example, handwriting recognition or dictation support for another language. In these cases, you will need to train your own model and this tutorial will show you how to do that for the CMUSphinx speech recognition engine. Before starting with training make sure you are familar with concepts, prepared the language model and you indeed need to train the model and have resources to do that.

然而，有些应用使用当前的模型不能运行，比如，手写识别或者另一种语言的听写支持，在这些情况下，你需要训练自己的模型，本教程将向你展示如何为CMUSphinx语音识别引擎训练模型。在开始训练之前，你需要熟悉基本概念、准备语言模型，准备资源来训练模型。

When you need to train

什么时候需要训练模型

You want to create an acoustic model for new language/dialect
你想为一个新的语言/方言创建一个声学模型。
OR you need specialized model for small vocabulary application
或者你需要为一个小词汇量的应用准备模型
AND you have plenty of data to train on:
你有足够的数据进行训练：
- 1 hour of recording for command and control for single speaker
- 对于单个说话人，1个小时的命令和控制录音
- 5 hour of recordings of 200 speakers for command and control for many speakers
- 对于多个说话人，200个说话人5个小时的命令和控制录音
- 10 hours of recordings for single speaker dictation
- 对于单个说话人听写，10个小时的录音
- 50 hours of recordings of 200 speakers for many speakers dictation
- 对于多个说话人听写，200个说话人50个小时的录音
AND you have knowledge on phonetic structure of the language
你要知道语言的语音结构知识
AND you have time to train the model and optimize parameters (1 month)
你要时间来训练模型和优化参数(1个月)

When you don't need to train

什么时候不需要训练模型

You need to improve accuracy - do acoustic model adaptation instead
需要改善精度 - 用声学模型自适应来代替
You don't have enough data - do acoustic model adaptation instead
没有足够的数据 - 用声学模型自适应来代替
You don't have enough time
没有足够的时间
You don't have enough experience
没有足够的经验

Data preparation

准备数据

The trainer learns the parameters of the models of the sound units using a set of sample speech signals. This is called a training database. A choice of already trained databases will also be provided to you.

训练人员使用一组示例语音信号来学习声音单元的模型参数，这就是训练数据库，已经训练好的数据库也将提供给你

。

The database contains information required to extract statistics from the speech in form of the acoustic model.

这个数据库是从语音数据中以声学模型的形式提取的统计信息。

The trainer needs to be told which sound units you want it to learn the parameters of, and at least the sequence in which they occur in every speech signal in your training database. This information is provided to the trainer through a file called the transcript file, in which the sequence of words and non-speech sounds are written exactly as they occurred in a speech signal, followed by a tag which can be used to associate this sequence with the corresponding speech signal.

训练者应该被告知需要学习哪些语音单元的参数，至少在你的训练数据库中，按照每个语音信号顺序来学习。这个信息由transcript文件来提供给训练人员，语句单词和非语音声音严格按照在语音信号中的顺序写入这个文件中，可以用一个标签来将该序列与相应的语音信号关联起来。

The trainer then looks into a dictionary which maps every word to a sequence of sound units, to derive the sequence of sound units associated with each signal.

然后训练人员可以查看映射每个单词到一串语音单元的字典，来导出和每个信号相关联的声音单元的序列。

Thus, in addition to the speech signals, you will also be given a set oftranscripts for the database (in a single file) and two dictionaries, one in which legitimate words in the language are mapped sequences of sound units (or sub-word units), and another in which non-speech sounds are mapped to corresponding non-speech or speech-like sound units. We will refer to the former as the languagedictionary and the latter as thefiller dictionary.

于是，除了语音信号，你也会得到数据库的transcripts文件集(单个文件)和两个字典，一个用于映射声音单元的合理语言单词，另一个是非语音声音被影射到和非语音或类似语音的声音相关联的单元，我们可以参考以前的语言字典和最近的字典。

After training, it's mandatory to run the decoder to check training results. The Decoder takes a model, tests part of the database and reference transcriptions and estimates the quality (WER) of the model. During the testing stage we use the language model with the description of the order of words in the language.

训练过后，运行解码器检查训练结果是必要的。解码器获取一个模型，测试部分数据库，引用录音文本，估计模型的性能(错误率)，在测试阶段，我们按照语言上的语序描述来使用语言模型。

First of all, you need to design a database for training or download an existing one. For example, you can purchase a database from LDC. You'll have to convert it to a proper format.

首先，你需要设计一个训练数据库或者下载一个已经存在的，比如，你可以从LDC购买一个数据库，必须要把它转换成合适的格式。

A database should be a good representation of what speech you are going to recognize. For example if you are going to recognize telephone speech its preferred to use telephone recordings. If you want mobile speech, you should better find mobile recordings. Speech is significantly different across various recording channels. Broadcast news is different from telephone. Speech decoded from mp3 is significantly different from the microphone recording. However, if you do not have enough speech recorded in required condition you should definitely use other speech you have. For example you can use broadcast recordings. Sometimes it make sense to stream through the telephone codec to make audio similar. It's often possible to add noise to training data too.

数据库中准备识别的语音应该具有较好的代表性，比如，准备识别电话语音识别，那么最好用电话录音；如果你想识别移动手机语音，那么最好找到移动电话的录音；语音经过不同的录音频道有很大的不同，广播新闻和电话不同，从mp3解码的语音和麦克风的录音也是明显不一样。然而，如果你没有足够的在要求环境中录制的语音，你应该使用其他的语音，比如，你可以使用广播录音，有时候通过电话编解码器使音频流相似也是有必要的，也有可能在训练数据中会加入噪音。

Database should have enough speakers recording, variety of recording conditions, enough acoustic variations and all possible linguistic sentences. The size of the database depends on the complexity of the task you want to handle as mentioned above. A Database should have the two parts mentioned above - training part and test part. Usually test part is about 1/10th of the full data size, but we don't recommend you to have test data more than 4 hours of recordings.

数据库应该有足够的说话人的录音，在各种条件下的录音，足够多的声学变化和所有可能的句子语言。如上所述，数据库的大小由任务的复杂度决定，数据库应该有上面提到的两个部分，训练部分和测试部分，通常测试部分大约占全部数据的1/10，但是我们不推荐你的录音测试数据超过4个小时。

The good ways to obtain a database for a new language are:

获得新语言数据库的好方法是：

Manually segment audio recordings with existing transcription (podcasts, news, etc)
对现存的数据进行手动分割(广播，新闻等)
Record your friends and family and colleagues
对你的朋友、家庭成员和同事进行录音
Setup automated collection on Voxforge
在Voxforge上自动收集

You have to design database prompts and postprocess the results to ensure that audio actually corresponds to prompts. The file structure for the database is:

你必须设计数据库的提示和结果的后续处理来保证实际中音频和提示一致，数据库的文件结构：

etc
- your_db.dic - Phonetic dictionary
- your_db.phone - Phoneset file
- your_db.lm.DMP - Language model
- your_db.filler - List of fillers
- your_db_train.fileids - List of files for training
- your_db_train.transcription - Transcription for training
- your_db_test.fileids - List of files for testing
- your_db_test.transcription - Transcription for testing
wav
- speaker_1
  - file_1.wav - Recording of speech utterance
- speaker_2
  - file_2.wav

Let's go through the files and describe their format and the way to prepare them:

让我们去看看这些文件，描述它们的格式和准备他们的方式

Fileids (your_db_train.fileids and your_db_test.fileids)file is a text file listing the names of the recordings (utterance ids) one by line, for example

Fileids(your_db_train.fileids and your_db_test.fileids)是逐行列出录音名称的文本文件(语料ID)，比如

   speaker_1/file_1   speaker_2/file_2

Fileids file contains the path in a filesystem relative to wav directory. Note that fileids file should have no extensions for audio files, just the names.

Fileids文件包含相对于wav文件系统目录的路径，注意，Fileids文件没有为音频文件进行扩展，仅仅是名称而已。

Transcription file (your_db_train.transcription and your_db_test.transcription)is a text file listing the transcription for each audio file

Transcription(your_db_train.transcription and your_db_test.transcription)文件列出每个音频文件的内容列表。

   <s> hello world </s> (file_1)   <s> foo bar </s> (file_2)

It's important that each line starts with <s> and ends with </s> followed by id in parentheses. Also note that parenthesis contains only the file, without speaker_n directory. It's critical to have exact match between fileids file and the transcription file. The number of lines in both should be identical. Last part of the file id(speaker1/file_1) and the utterance idfile_1 must be the same on each line.

每行以<s>开始，</s>结束，id在括号中，注意，括号中仅包含文件名称，没有speaker_n目录，fileids文件和transsription文件必须要严格一致，两个文件中行数必须相同，每行的最后本分的file id(speaker1/file_1)和utterance id file_1 一定要相同。

Below is an example of a *wrongly* created fileids file for the above transcription file. If you follow it, you will get an error as discussedhere

对于上面的transcription文件，下面是一个错误创建的fileids文件

   speaker_2/file_2   speaker_1/file_1   //Error! Do not create fileids file like this!

Speech recordings (wav files) Recording files must be inMS WAV format with specific sample rate - 16 kHz, 16 bit, mono for desktop application, 8kHz, 16bit, mono for telephone applications. Double-check that, wrong audio file format is the most common source of training issues. Audio files shouldn't be very long and shouldn't be very short. Optimal length is not less than 5 seconds and not more than 30 seconds.

语音录音(wav文件)文件必须是MS WAV格式，桌面应用程序采用16kHz、16bit、单声道；电话应用采用8kHz、16bit、单声道。重新检查一遍，因为音频文件的格式错误是大部分训练源数据的问题，音频数据不需要很长，也不能很短，理想的长度不少于5秒，不多于30秒。

It's critical to have audio files in a specific format. Sphinxtrain does support some variety of sample rates but by default it's configured to train from16khz 16bit mono files inMS WAV format.YOU NEED TO MAKE SURE THAT YOU RECORDINGS ARE AT A SAMPLING RATE OF 16 KHZ (or 8 kHz if you train a telephone model) IN MONO WITH SINGLE CHANNEL.

特定格式的音频文件是很重要的，Sphinxtrain支持一些采样率，但是默认的训练配置支持的是16KHz 16bit单声道的MS WAV格式，你必须保证你的额录音是在16KHZ采样率、单声道下进行的(或者训练电话模型采用8kHz)。

If you train from 8khz model you need to make sure you configured feature extraction properly.、

如果你训练8kHz模型，你必须保证特征提取的配置也是正确的。

Please note that you CAN NOT UPSAMPLE your audio, that means you can not train 16 khz model with 8khz data.

Audio format mismatch is the most common training problem.

请注意，你不能将音频采样率弄错，意思就是你不能训练16kHz的模型，用的却是8kHz的数据，音频格式不匹配是训练中常见错误。

Phonetic Dictionary (your_db.dict) should have one line per word with word following the phonetic transcription

语音字典(your_db.dict)应该在每行的紧跟着每个词的音素

HELLO HH AH L OWWORLD W AO R L D

If you need to find phonetic dictionary, read Wikipedia or a book on phonetics. If you are using existing phonetic dictionary. Do not use case-sensitive variants like “e” and “E”. Instead, all your phones must be different even in case-insensitive variation. Sphinxtrain doesn't support some special characters like '*' or '/' and supports most of others like ”+” or ”-” or ”:” But to be safe we recommend you to use alphanumeric-only phone-set.

如果你需要寻找语音字典，查询维基百科或者语音学书籍，如果你使用已有的语音字典，不要使用区分大小写的字典，你的所有的语言在不区分大小写变化时都不一样，Sphinxtrain不支持一些特殊字符，'*' 或 '/'，支持大部分其他的字符，像"+" 或 "-"或":"，但是为安全起见，我们推荐使用纯字母数字的音素集。

Replace special characters in the phone-set, like colons or dashes or tildes, with something alphanumeric. For example, replace “a~” with “aa” to make it alphanumeric only. Nowadays, even cell phones have gigabytes of memory on board. There is no sense in trying to save space with cryptic special characters.

用一些字母数字替换特殊字符，像colons或dashes或tildes。比如，替换"a~"为"aa"使它们仅仅有字母数字，现在手机的内存都上G了，所以没有必要为节省空间而使用特殊的隐含字符了。

There is one very important thing here. For a large vocabulary database, phonetic representation is more or less known; it's simple phones described in any book. If you don't have a phonetic book, you can just use the word's spelling and it gives very good results:

有件非常重要的事情，对于大词汇量的数据库，音素拼写要正确，简单音素在任何一本书中都有所描述，如果你没有语音学书籍，你可以使用单词拼写也可以取得良好的效果：

ONE O N ETWO T W O

For small vocabulary CMUSphinx is different from other toolkits. It's often recommended to train word-based models for small vocabulary databases like digits. But it only makes sense if your HMMs could have variable length.CMUSphinx does not support word models. Instead, you need to use a word-dependent phone dictionary:

对于小词汇量而言，CMUSphinx不同于其他的工具，它经常用来训练基于小词汇量数据库的单词模型，比如数字。但是HMMs是可变长度才有意义，CMUSphinx不支持单词模型，可以用依赖单词的音素字典来代替：

ONE W_ONE AH_ONE N_ONETWO T_TWO UH_TWONINE N_NINE AY_NINE N_END_NINE

This is actually equivalent to word-based models and some times even gives better accuracy.Do not use word-based models with CMUSphinx.

这实际上跟基于单词的模型差不多，有时精度会更高，不要将基于单词的模型用于CMUSphinx。

Phoneset file (your_db.phone) should have one phone per line. The number of phones should match the phones used in the dictionary plus the special SIL phone for silence:

音素集合文件(your_db.phone)每行都有一个音素，音素的个数应该等于字典中音素和特殊的SIL静音音素之和。

AHAXDHIX

Language model file (your_db.lm.DMP) should be in ARPA format or in DMP format. Find our more about language models onLanguage Model training chapter.

语言模型文件(your_db.lm.DMP)应该是ARPA格式或者DMP格式，在Language Model training chapter中可以找到更多有关语言模型的信息。

Filler dictionary (your_db.filler) contains filler phones (not-covered by language model non-linguistic sounds like breath, hmm or laugh). It can contain just silences:

补白字典(your_db.filler)包含补白的音素(不涵盖像呼吸、嗯、笑等非语言学的语言模型)，可以仅仅包含静音：

<s> SIL</s> SIL<sil> SIL

Or filler phones if they are present in the db transcriptions:

或者补白音素如果它们在数据库中出现：

+um+ ++um+++noise+ ++noise++

For example, the database downloaded for trainingan4 database, you can use it in following steps. If you want to play with large example, download Voxforge English acoustic database. It's about 50 hours of audio recordings now.

比如，为训练an4 数据库下载数据库，你可以按照以下步骤来使用它，如果你想使用更大的，下载Voxforge英语声学数据库，有将近50个小时的音频录音。

Compilation of the required packages

编译需要的软件包

The following packages are required for training:

下面的包软件包是训练所需的：

sphinxbase-0.8
SphinxTrain-0.8
pocketsphinx-0.8

The following external packages are also required:

以下外部包也是所需要的：

perl, for example ActivePerl on Windows
perl，比如Windows系统的ActivePerl
python, for example ActivePython on Windows
python，比如Windows系统上的ActivePython

In addition, if you download packages with a .gz suffix, you will need gunzip or the equivalent to unpack them.

Install the perl and python packages somewhere in your executable path, if they are not already there.

另外，如果下载带.gz后缀的软件包，你需要gunzip或者其他类似软件来解压它们，如果没有安装perl和python，在可执行路径中安装它们。

We recommend that you train on Linux: this way you'll be able to use all the features of Sphinxtrain. You can also use a Windows system for training, in that case we recommend to use ActivePerl.

我们推荐你在Linux系统上训练，在Linux系统上你可以使用Sphinxtrain上的所有特征。你也可以使用windows系统来训练，在这种情况下我们推荐使用ActivePerl。

For download instructions, see Download page. Basically you need to put everything into single root folder, unzip and untar them, and runconfigure andmake andmake install in each package folder. Put the database folder into this root folder as well. By the time you finish this, you will have a tutorial directory with the following contents

在Download page查看下载说明，你需要将所有文件都放在一个根文件夹下，解压它们，在每个软件包文件夹中运行configure、make和make install，把数据库文件夹也放在根文件夹下，完成这些后，有一个教程文件夹，里面有以下内容：

  tutorial    an4    an4_sphere.tar.gz    SphinxTrain    SphinxTrain.tar.gz    pocketsphinx    pocketsphinx.tar.gz    sphinxbase    sphinxbase.tar.gz

You will need to install software as an administratorroot. After you installed the software you may need to update the system configuration so the system will be able to find the dynamic libraries. For example

你需要以管理员root来安装安装，安装完这些软件后，你需要更新系统的配置，那样系统就可以找到动态库了，比如：

export PATH=/usr/local/bin:$PATHexport LD_LIBRARY_PATH=/usr/local/libexport PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

If you don't want to install into system path, you may install in your home folder. In that case you can append the following option toautogen.sh script or to theconfigure script

如果不想安装进系统目录中，你可以安装进你的主文件夹中，在那种情况下，你可以将下面的选项加到autogen.sh脚本中或者添加到configure脚本中。

--prefix=/home/user/local

Obvsiously the folder can be an arbitrary folder but remember to update the environment configuration after that. If you will find that your binaries fail to load dynamic libraries, something likefailed to open libsphinx.so.0 no such file or directory it means that you didn't configure the environment properly. If you will ask a question on forum about thatfailed to open I will ban you in a next minute.

文件夹可以是任意的，放在自己的主文件夹后，记住要更新配置环境，如果你发现运行程序无法加载动态库，发生像failed to open libsphinx.so.0 no such file or directory这样的事情，这意味着环境没有配置正确，如果你在论坛上问关于failed to open的问题，我将在下一分钟取缔你。

Setting up the training scripts

建立训练脚本

To start the training change to the database folder and run the following commands:

为了开始训练，到数据库文件夹下，运行以下命令：

On Linux

sphinxtrain -t an4 setup

On Windows

python ../sphinxtrain/scripts/sphinxtrain -t an4 setup

Do not forget to replace an4 with your task name.

不要忘记加an4替换成你的任务的名称。

This will copy all the required configuration files into etc subfolder of your database folder and prepare database for training, the structure will be:

拷贝所有需要的文件到数据库文件夹的etc子文件夹下，准备训练数据库，结构如下：

  etc  feat  logdir  model_parameters  model_architecture    wav

After setup step only two folders of the above will be present, others will be created during the training process:

设置开始后，仅仅有上面两个文件夹会呈现，其他的文件夹将在训练过程中创建：

  etc  wav

After that, we need to edit the configuration files in etc folder, there are many variables but to get started we need to change only a few. First of all find the fileetc/sphinx_train.cfg

之后，我们需要编辑etc文件夹中的配置文件，文件中有很多变量，我们需要改变其中一些来启动训练，第一件事就是找到文件etc/sphinx_train.cfg。

Setup the format of database audio

设置数据库的格式音频

$CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav";$CFG_WAVFILE_EXTENSION = 'sph';$CFG_WAVFILE_TYPE = 'nist'; # one of nist, mswav, raw

If you recorded MS Wav, change sph towav here andnist tomswav.

如果你录制了MS Wav格式的音频，将sph改变为wav，nist改变为mswav。

$CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav";$CFG_WAVFILE_EXTENSION = 'wav';$CFG_WAVFILE_TYPE = 'mswav'; # one of nist, mswav, raw

Configure path to files

配置文件路径

See the following lines in your etc/sphinx_train.cfg file:

下面是etc/sphinx_train.cfg文件中的行：

# Variables used in main training of models$CFG_DICTIONARY     = "$CFG_LIST_DIR/$CFG_DB_NAME.dic";$CFG_RAWPHONEFILE   = "$CFG_LIST_DIR/$CFG_DB_NAME.phone";$CFG_FILLERDICT     = "$CFG_LIST_DIR/$CFG_DB_NAME.filler";$CFG_LISTOFFILES    = "$CFG_LIST_DIR/${CFG_DB_NAME}_train.fileids";$CFG_TRANSCRIPTFILE = "$CFG_LIST_DIR/${CFG_DB_NAME}_train.transcription"

These values would be already set if you set up the file structure like described earlier, but make sure that your files are really named this way.

如果你像先前表述的那样建立文件结构，这些值可能已经被赋值，但是要保证你的文件已按照这样的方式进行命名。

The $CFG_LIST_DIR variable is the /etc directory in your project, and the $CFG_DB_NAME variable is the name of your project itself.

在项目中，$CFG_LIST_DIR变量代表/etc目录，$CFG_DB_NAME变量代表项目本身的名称。

Configure model type and model parameters

配置模型类型和模型参数

$CFG_HMM_TYPE = '.cont.'; # Sphinx 4, Pocketsphinx#$CFG_HMM_TYPE  = '.semi.'; # PocketSphinx#$CFG_HMM_TYPE  = '.ptm.'; # PocketSphinx (larger data sets)

Just uncomment what you need. For resource-efficient applications use semi-continuous models, for sphinx4 use continuous models

仅仅取消备注你需要的，对于资源效率型应用程序使用半连续型模型，sphinx4使用连续模型。

 $CFG_FINAL_NUM_DENSITIES = 8;

If you are training continuous models for large vocabulary and have more than 100 hours of data, put 32 here. It can be any degree of 2: 4, 8, 16, 32, 64.

如果你训练大词汇量的连续模型，有超过100个小时的数据，将变量赋值为32，它可以是2的任何阶层。

# Number of tied states (senones) to create in decision-tree clustering$CFG_N_TIED_STATES = 1000;

This value is the number of senones to train in a model. The more senones model has, the more precisely it discriminates the sounds. But on the other hand if you have too many senones, model will not be generic enough to recognize unseen speech. That means that the WER will be higher on unseen data. That's why it isimportant to not overtrain the models. In case there are too many unseen senones, the warnings will be generated in the norm log on stage 50 below:

这个变量是模型中训练的音素数量，模型有的音素越多，辨别声音的精度越高，但是，另一方面，如果你有太多的音素，模型就不够通用来识别未见过的语音，这就意味着对未见过的数据WER就会越高，这就是为什么不要过度训练模型的重要性。防止有很多的未见过的音素，在规范日志中，会产生很多警告:

ERROR: "gauden.c", line 1700: Variance (mgau= 948, feat= 0, density=3, component=38) is less then 0. Most probably the number of senones is toohigh for such a small training database. Use smaller $CFG_N_TIED_STATES.

The approximate number of senones and number of densities is provided in the table below:

大概的音素数和密度数在下面表格中：

Vocabulary Hours_in_db Senones Densities Example 20 5 200 8 Tidigits Digits Recognition 100 20 2000 8 RM1 Command and Control 5000 30 4000 16 WSJ1 5k Small Dictation 20000 80 4000 32 WSJ1 20k Big Dictation 60000 200 6000 16 HUB4 Broadcast News 60000 2000 12000 64

Fisher Rich Telephone Transcription

Of course you also need to understand that only senones present in transcription could be trained. It means that if your transcription isn't generic enough, for example it's the same single word spoken by 10000 speakers 10000 times you still have just a few senones no matter how many hours of speech did you record. In that case you just need a few senones in the model, not few thousands of them.

当然，你也需要理解即使仅仅只有音素也可以训练，意思就是如果你的文本不够，比如，同一个单独的单词，有10000个人读了10000遍，无论录制了多少小时的音频，你仍然仅仅只有几个音素，在那种情况下，你仅仅需要模型中的几个音素，而不是几千个。

It might seem that diversity could improve the model but it's not the case. Diverse speech requires some artificial speech prompts and that decrease the naturalness of the speech. Artificial models don't help in real life decoding. In order to build a best database you need to try to reproduce real environment as much as possible. It's even better to collect more speech to try to optimize the database size.

看起来似乎多样性可以改善模型，但是情况并非如此，多样化的语音要求一些人工语音提示，那样就会降低语音的自然特性，人造模型在真实生活的解码中并没有帮助，为了创建一个最好的数据库，你需要尝试着尽可能的重塑真实环境，收集更多的语音数据来优化数据库大小是更好的做法。

It's important to remember, that optimal numbers depends on your database. To train model properly, you need to experiment with different values and try to select the ones which give best WER for a development set. You can experiment with number of senones, number of gaussian mixtures at least. Sometimes it's also worth to experiment with phoneset or number of estimation iterations.

重要的是要记住，最优数量取决于数据库，为了适当的训练模型，你需要用不同的值来进行试验，尝试着为开发集选出最佳WER的那些值，你可以用音素来试验，至少是高斯混合数。有时，也是值得用音素集或估计的迭代次数来进行试验。

Configure sound feature parameters

配置声音特征参数

The default for sound files used in Sphinx is a rate of 16 thousand samples per second (16KHz). If this is the case, the etc/feat.params file will be automatically generated with the recommended values.

Sphinx中默认的声音文件每秒采样16000个点，如果采用默认值，etc/feat.params文件会采用推荐值自动产生。

If you are using sound files with a sampling rate of 8KHz (telephone audio), you need to change some values in etc/sphinx_train.cfg. The lower sampling rate also means a change in the sound frequency ranges used and the number of filters used to recognize speech. Recommended values are:

如果你使用采样率为8KHz的声音文件(电话音频)，你需要改变etc/sphinx_train.cfg文件中的一些值，越低的采样率也意味着使用的声音频率范围的改变和语音识别使用的滤波器的改变，推荐值：

# Feature extraction parameters$CFG_WAVFILE_SRATE = 8000.0;$CFG_NUM_FILT = 31; # For wideband speech it's 40, for telephone 8khz reasonable value is 31$CFG_LO_FILT = 200; # For telephone 8kHz speech value is 200$CFG_HI_FILT = 3500; # For telephone 8kHz speech value is 3500

Configure decoding parameters

配置解码参数

Open etc/sphinx_train.cfg, make sure the following is properly configured:

打开etc/sphinx_train.cfg文件，保证下面被正确配置：

$DEC_CFG_DICTIONARY     = "$DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME.dic";$DEC_CFG_FILLERDICT     = "$DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME.filler";$DEC_CFG_LISTOFFILES    = "$DEC_CFG_BASE_DIR/etc/${DEC_CFG_DB_NAME}_test.fileids";$DEC_CFG_TRANSCRIPTFILE = "$DEC_CFG_BASE_DIR/etc/${DEC_CFG_DB_NAME}_test.transcription";$DEC_CFG_RESULT_DIR     = "$DEC_CFG_BASE_DIR/result";# These variables, used by the decoder, have to be user defined, and# may affect the decoder output$DEC_CFG_LANGUAGEMODEL_DIR = "$DEC_CFG_BASE_DIR/etc";$DEC_CFG_LANGUAGEMODEL  = "$DEC_CFG_LANGUAGEMODEL_DIR/an4.lm.DMP";

If you are training with an4 please make sure that you changed an4.lm.DMP to an4.ug.lm.DMP since the name of the acoutic model is different in an4 database.

如果你训练an4，请保证将an4.lm.DMP改为an4.ug.lm.DMP，因为声学模型名称在an4数据库中是不同的。

If everything is ok, you can proceed to training.

如果一切准备就绪，可以进行训练了。

Training

训练

First of all, go to the database directory:

首先，到数据库目录下：

cd an4

To train, just run the following commands:

仅需运行以下命令来运行:

On Linux

sphinxtrain run

On Windows

python ../sphinxtrain/scripts/sphinxtrain run

and it will go through all the required stages. It will take a few minutes to train. On large databases, training could take a month.

这将经过所有必需的阶段，将会花费一些时间来训练，对大数据库来说，训练可能需要花费一个月的时间。

During the stages, the most important stage is the first one which checks that everything is configured correctly and your input data is consistent.Do not ignore the errors reported on the first 00.verify_all step.

在这些阶段，最重要的是第一个阶段，此阶段检查一切配置是否正确和输入数据是否一致，不要忽略所有步骤的报告错误。

The typical output during decoding will look like:

在解码过程中典型的输出：

        Baum welch starting for 2 Gaussian(s), iteration: 3 (1 of 1)        0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%         Normalization for iteration: 3        Current Overall Likelihood Per Frame = 30.6558644286942        Convergence Ratio = 0.633864444461992        Baum welch starting for 2 Gaussian(s), iteration: 4 (1 of 1)        0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%         Normalization for iteration: 4

These scripts process all required steps to train the model. Training is complete.

这些脚本处理所需的所有步骤来训练模型。

Training Internals

培训内部

This section describes what happens during the training. In the scripts directory (./scripts_pl), there are several directories numbered sequentially from 00* through 99*. Each directory either has a directory namedslave*.pl or it has a single file with extension .pl. The script sequentially goes through the directories and executes either the theslave*.pl or the single.pl file, as below.

这部分描述在训练过程中发生了什么，在scripts目录下(./scripts_pl)，有一些按照00*到99*顺序编号的目录，每个目录要么有一个命名为slave*.pl的目录或者有一个单一的带.pl扩展名的文件，脚本按顺序从目录执行slave*.pl或者单个的.pl文件，像下面这样：

perl scripts_pl/000.comp_feat/slave_feat.plperl scripts_pl/00.verify/verify_all.plperl scripts_pl/10.vector_quantize/slave.VQ.plperl scripts_pl/20.ci_hmm/slave_convg.plperl scripts_pl/30.cd_hmm_untied/slave_convg.plperl scripts_pl/40.buildtrees/slave.treebuilder.plperl scripts_pl/45.prunetree/slave-state-tying.plperl scripts_pl/50.cd_hmm_tied/slave_convg.plperl scripts_pl/90.deleted_interpolation/deleted_interpolation.pl

Scripts launch jobs on your machine, and the jobs will take a few minutes each to run through.

在机器上启动脚本工作，这项工作将花费几分钟来运行一遍。

Before you run any script, note the directory contents of your current directory. After you run each slave*.pl note the contents again. Several new directories will have been created. These directories contain files which are being generated in the course of your training. At this point you need not know about the contents of these directories, though some of the directory names may be self explanatory and you may explore them if you are curious.

在运行任何脚本之前，记录当前目录的目录内容，在运行每个slave*.pl文件后，再次记录内容，几个新的目录将会被创建，这些目录包含了一些在训练过程中产生的文件，这时候，你不需要知道这些目录中的内容，尽管一些目录的名称可能是自说明的，如果你感兴趣也可以看看。

One of the files that appears in your current directory is an .html file, named an4.html, depending on which database you are using. This file will contain a status report of jobs already executed. Verify that the job you launched completed successfully. Only then launch the next slave*.pl in the specified sequence. Repeat this process until you have run the slave*.pl in all directories.

在当前目录中出现的一个.html文件，命名为an4.html，取决于你使用的是哪一个数据库，这个文件包含已运行的工作状态报告，确认开启的工作已成功完成，按特定的顺序开始启动下一个slave*.pl文件，重复这个过程直到运行完了所有的目录的slave*.pl。

Note that in the process of going through the scripts in 00* through 90*, you will have generated several sets of acoustic models, each of which could be used for recognition. Notice also that some of the steps are required only for the creation of semi-continuous models. If you execute these steps while creating continuous models, the scripts will benignly do nothing.

请注意，在00*到90*脚本的过程中，会有几个声学模型生成，其中的每一个都可以用作识别，还要注意的是，半连续模型的创建需要一些额外的步骤，如果在创建连续模型时，执行了这些步骤，脚本不会做任何事。

On the stage 000.slave_feat the feature feles are extracted. The system does not directly work with acoustic signals. The signals are first transformed into a sequence of feature vectors, which are used in place of the actual acoustic signals.

在 000.slave_feat 阶段，特征文件将被提取，系统不会和声学信号直接发生作用，信号首先转变成特征向量序列，这些序列将应用到实际的声学信号中。

This script make_feats.pl will compute, for each training utterance, a sequence of 13-dimensional vectors (feature vectors) consisting of the Mel-frequency cepstral coefficients (MFCCs). Note that the list of wave files contains a list with the full paths to the audio files. Since the data are all located in the same directory as you are working, the paths are relative, not absolute. You may have to change this, as well as the an4_test.fileids file, if the location of data is different. The MFCCs will be placed automatically in a directory called 'feat'. Note that the type of features vectors you compute from the speech signals for training and recognition, outside of this tutorial, is not restricted to MFCCs. You could use any reasonable parameterization technique instead, and compute features other than MFCCs. CMUSphinx can use features of any type or dimensionality. The format of the features is described on the pageMFC Format.

maoke_feats.pl脚本将使用13维梅尔倒频谱参数组成的空间向量(特征向量)序列来计算训练语料，wave文件列表包含音频文件的全路径，由于数据在当前工作目录，路径是相对的，而不是绝对的。如果数据的位置和工作目录不同时，你需要改变路径和an4_test.fileids文件，MFCCs会自动存放在feat的文件夹中。在本教程之外，待训练和识别的语音信号中计算出的特征向量的类型并不只有MFCCs，可以使用任何合理的参数技术带替代，计算除了MFCCs之外的特征值，CMUSphinx可以使用任何类型的特性或维度，特征值的格式在 MFC Format 页面中有描述。

Once the jobs launched from 20.ci_hmm have run to completion, you will have trained the Context-Independent (CI) models for the sub-word units in your dictionary.

一旦工作从20.ci_hmm启动到完成，在目录中将训练产生上下文无关的子单词单元模型。

When the jobs launched from the 30.cd_hmm_untieddirectory run to completion, you will have trained the models for Context-Dependent sub-word units (triphones) with untied states. These are called CD-untied models and are necessary for building decision trees in order to tie states.

当工作从30.cd_hmm_untied目录启动到完成，将会训练产生上下文相关的子单词单元(三元素)的模型，为了约束状态构建决策树是必要的。

The jobs in 40.buildtrees will build decision trees for each state of each sub-word unit.

在40.buildtrees的工作阶段会为每个子单词单元的每个状体构建决策树。

The jobs in 45.prunetree will prune the decision trees and tie the states.

在45.prunetree的工作阶段将修建决策树和约束状态。

Following this, the jobs in 50.cd-hmm_tied will train the final models for the triphones in your training corpus. These are called CD-tied models. The CD-tied models are trained in many stages. We begin with 1 Gaussian per state HMMs, following which we train 2 Gaussian per state HMMs and so on till the desired number of Gaussians per State have been trained. The jobs in 50.cd-hmm_tied will automatically train all these intermediate CD-tied models.

然后，在50.cd-hmm_tied工作阶段，从训练语料中得到组后三音素的模型，这些叫CD约束模型，这些模型在很多阶段都在被训练，我们从每个HMMs状态的一阶高斯开始，然后二阶高斯等等，直到每个状态的高斯阶数都被训练，50.cd-hmm_tied工作会自动训练这些中间的CD约束模型。

At the end of any stage you may use the models for recognition. Remember that you may decode even while the training is in progress, provided you are certain that you have crossed the stage which generates the models you want to decode with.

在任何阶段的结束时，可以使用模型来进行识别，你甚至可以在训练过程中进行解码，假设你确定你跨越的阶段已经产生了你想要解码的模型。

Transformation Matrix Training (advanced)

Some additional scripts will be launched if you choose to run them. Transformational matrices might help the training and recognition process in some circumstances.

如果你选择启动一些额外的脚本，它们将会启动。在有些情况下，状态转移矩阵将会帮助训练和识别过程。

The following steps will run if you specify $CFG_LDA_MLLT = 'yes'; in the file sphinx_train.cfg. If you specify 'no', the default, the steps will do nothing.

如果在sphinx_train.cfg文件中指定$CFG_LDA_MLLT = 'yes'，下面的步骤将会执行，如果指定‘no’，这是默认的，下面步骤不会做任何事情。

perl scripts_pl/60.lattice_generation/slave_genlat.plperl scripts_pl/61.lattice_pruning/slave_prune.plperl scripts_pl/62.lattice_conversion/slave_conv.pl

The perl scripts, in turn, set up and run python modules. The end product for these steps is a file,feature_transform, in yourmodel_parameters directory.

反过来，perl脚本设置和运行python模块，这些步骤的最后结果是一个文件，feature_transform，在model_parameters目录中。

These additional training steps can be costly in computation, but improve recognition in models such as those with 8K sampling rates.

这些额外训练步骤计算非常耗时，但可以改善8k采样率的模型。

Finally, one more step will run if you specify MMIE training with$CFG_MMIE = “yes”;.Default is“no”.

最后，还有一个步骤，如果指定了MMIE训练$CFG_MMIE = “yes”;，就会执行，默认是‘no’。

perl scripts_pl/65.mmie_train/slave_convg.pl

Testing

测试

It's critical to test the quality of the trained database in order to select best parameters, understand how application performs and optimize performance. To do that, a test decoding step is needed. The decoding is now a last stage of the training process.

为了选择最好的参数、了解应用程序如何执行和优化性能，测试训练的数据库的质量是很重要的。为了做到这些，一个测试解码步骤是必须的，解码是训练过程的最后一个阶段。

You can restart decoding with the following command:

你可以使用下面的命令来重启解码：

sphinxtrain -s decode run

This command will start a decoding process using the acoustic model you trained and the language model you configured in the etc/sphinx_train.cfg file.

这个命令启动解码过程，使用训练的声学模型和在etc/sphinx_train.cfg文件中配置的语言模型。

MODULE: DECODE Decoding using models previously trained        Decoding 130 segments starting at 0 (part 1 of 1)         0%

When the recognition job is complete, the script computes the recognition Word Error Rate (WER) and Sentence Error Rate (SER). The lower those rates the better for you. For typical 10-hours task WER should be around 10%. For a large task, it could be like 30%.

当识别工作完成后，脚本计算识别的额单词错误率和句子错误率，比率越低越好，通常10小时的任务WER应该在10%，对大型的任务，WER大概30%左右.

On an4 data you should get something like:

对an4数据，你可能得到如下提示：

        SENTENCE ERROR: 70.8% (92/130)   WORD ERROR RATE: 30.3% (233/773)

You can find exact details of the decoding, like alignment with reference transcription, speed and result for each file, in result folder which will be created after decoding. Look into the filean4.align:

你可以找到解码的去确切细节，像参考文本的对准、速度和每个文件的结果，解码过后在结果文件夹中都将产生这些文件。

p   I   T      t   s   b   u   r   g   H      (MMXG-CEN5-MMXG-B)p   R   EIGHTY t   s   b   u   r   g   EIGHT  (MMXG-CEN5-MMXG-B)Words: 10 Correct: 7 Errors: 3 Percent correct = 70.00% Error = 30.00% Accuracy = 70.00%Insertions: 0 Deletions: 0 Substitutions: 3october twenty four nineteen seventy  (MMXG-CEN8-MMXG-B)october twenty four nineteen seventy  (MMXG-CEN8-MMXG-B)Words: 5 Correct: 5 Errors: 0 Percent correct = 100.00% Error = 0.00% Accuracy = 100.00%Insertions: 0 Deletions: 0 Substitutions: 0TOTAL Words: 773 Correct: 587 Errors: 234TOTAL Percent correct = 75.94% Error = 30.27% Accuracy = 69.73%TOTAL Insertions: 48 Deletions: 15 Substitutions: 171

For description of WER see our Basic concepts of speech chapter.

WER的说明请看Basic concepts of speech章节。

Using the model

使用模型

After training, the acoustic model is located in

训练过后，声学模型在

model_parameters/<your_db_name>.cd_cont_<number_of senones>

or in

或者在

model_parameters/<your_db_name>.cd_semi_<number_of senones>

You need only that folder. The model should have the following files:

你只需要这个文件夹，模型文件夹下应该有以下文件：

mdeffeat.paramsmixture_weightsmeansnoisedicttransition_matricesvariances

Depending on the type of the model you trained. To use the model in pocketsphinx, simply point to it with the -hmm option:

这取决于训练模型的类型，在pocketsphinx中使用模型，用-hmm选项指出：

pocketsphinx_continuous -hmm <your_new_model_folder> -lm your_lm -dict your_dict.

To use the trained model in sphinx4, read the documentation on Using SphinxTrain models in sphinx4.

在sphinx4中使用训练模型，请读文档 Using SphinxTrain models in sphinx4。

Troubleshooting

排错

Troubleshooting is not rocket science. For all issues you may blame yourself. You are most likely the reason of failure. Carefully read the messages in the logdir folder that contains detailed log of actions performed for each. In addition, messages are copied to the file, your_project_name.html, which you can read in a browser.

故障排除并不是火箭科学，所有问题你会责怪你自己，你是最有可能失败的原因，仔细阅读logdir文件夹里的信息，里面包含了每个步骤的执行信息，另外，信息被拷贝在your_project_name.html文件中，你可以使用浏览器来阅读。

There are many well-working proven methods to solve issues. For example, try to reduce the training set to see in which half the problem appears.

有很多行之有效的方法来解决这些问题，比如，试着减少一半的训练，问题就会出现。

Here are some common problems:

这里有些常见的问题：(问题就不翻译了，出现什么问题，可以参照下面的问题，比对一下是什么问题，亲们自己翻译一下)

WARNING: this phone (something) appears in the dictionary (dictionary file name), but not in the phone list (phone file name).

Your dictionary either contains a mistake, or you have left out a phone symbol in the phone file. You may have to delete comment lines from your dictionary file.

WARNING: This word (word) has duplicate entries in (dictionary file name). Check for duplicates.

You may have to sort your dictionary file lines to find them. Perhaps a word is defined in both upper and lower case forms.

WARNING: This word: word was in the transcript file, but is not in the dictionary (transcript line) Do cases match?

Make sure that all the words in the transcript are in the dictionary, and that they match case when they appear. Also, words in the transcript may be misspelled, run together or be a number or symbol not in the dictionary. If the dictionary file is not perfectly sorted, some entries might be skipped in looking for words. If you have hand-edited the dictionary file, be sure that each entry is in the proper format.

You may have specified phones in the phone list that are not represented in the words in the transcript. The trainer expects to find examples of each phone at least once.

WARNING: CTL file, audio file name.mfc, does not exist, or is empty.

The .mfc files are the feature files converted from the input audio files in make_feats.pl. Did you skip this step? Did you add new audio files without converting them? The script expects a feature file to be there, and it isn't.

Very low recognition accuracy.

This might happen if there is a mismatch in the audio files and the parameters of training, or between the training and the testing.

ERROR: "backward.c", line 430: Failed to align audio to trancript: final state of the search is not reached.

Sometimes audio in your database doesn't match the transcription properly. For example transcription file has the line “Hello world” but in audio actually “Hello hello world” is pronounced. Training process usually detects that and emits this message in the logs. If there are too many such errors it most likely mean you misconfigured something, for example you had a mismatch between audio and the text caused by transcription reordering. If there are few errors, you can ignore them. You might want to edit the transcription file to put there exact word which were pronounced, in the case above you need to edit the transcription file and put “Hello hello world” on corresponding line. You might want to filter such prompts because they affect acoustic model quality. In that case you need to enable forced alignment stage in training. To do that edit sphinx_train.cfg line

$CFG_FORCEDALIGN = 'yes';

and run training again. It will execute stages 10 and 11 and will filter your database.

To ask for help

If you want to ask for help about training, try to provide the training folder or at least logdir. Pack the files into archive, upload to a public file sharing resource, then post the link to the resource. Remember, the more information you provide the faster you will solve the problem.