CMU sphinx学习（――训练自已的中文语言模型与声学模型）

来源：互联网发布：知乎菠萝斑马餐桌编辑：程序博客网时间：2024/05/07 07:27

http://www.cnblogs.com/huanghuang/archive/2011/07/18/2109101.html

同样是在该博客的基础上针对XP系统

一、训练语言模型

(1) 安装语言模型训练工具CMUCLMTK

VS2010直接编译即可

(2) 准备训练的文档

在 .txt中输入如下内容，记住结尾不可留“\n”(实验证明了这一点)。

切记<s>后有空格啊 </s>前也有空格，我不知道怎么就没了空格，找了好久的原因。

<s> 前进 </s>

<s> 后退 </s>

<s> 左转 </s>

<s> 右转 </s>

<s> 停止 </s>

(3)训练文档成模板

cmd命令行进入CMUCLMTK的bin文件下执行如下命令

    text2wfreq < a.txt > a.wfreq

    wfreq2vocab < a.wfreq > a.vocab

    text2idngram -vocab a.vocab -idngram a.idngram < a.txt

    idngram2lm -idngram a.idngram -vocab a.vocab -arpa a.arpa

    sphinx_lm_convert -i a.arpa -o a.lm.DMP

最终生成了语言模型a.lm.DMP

二、训练声学模型

(1)准备一些文件

      分别创建两个目录etc和wav，然后在两个目录下面制作一些数据文件，方便后面训练时使用。如下所示：

etc
- my_db.dic - Phonetic dictionary
- my_db.phone - Phoneset file
- my _db.lm.DMP - Language model
- my_db.fillers - List of fillers
- my_db_train.fileids - List of files for training
- my_db_train.transcription - Transcription for training
- my_db_test.fileids - List of files for testing
- my_db_test.transcription - Transcription for testing
wav
- speaker_1
  - 1_11.wav - Recording of speech utterance
- speaker_2
  - 2_11.wav - Recording of speech utterance
- speaker_3
  - 3_11.wav - Recording of speech utterance
- speaker_4
  - 4_11.wav - Recording of speech utterance
- speaker_5
  - 5_11.wav - Recording of speech utterance

(2)准备etc下的数据文件

    a_train.fileids  a_test.fileids

输入内容：

speaker_1/1_11

speaker_2/2_11

speaker_3/3_11

speaker_4/4_11

speaker_5/5_11

3.   a_train.transcription

4.   a_test.transcription

输入内容如下：

<s> 前进 </s> (1_11)

<s> 后退 </s> (2_11)

<s> 左转 </s> (3_11)

<s> 右转 </s> (4_11)

<s> 停止 </s> (5_11)

5.  a.dic

          输入内容：

  前进    Q IAN J IN

  后退    H OU T UI

  左转    Z UO ZH UAN

  右转    Y OU ZH UAN

  停止    T ING ZH IB

6.a.filler

    输入内容：

 <s> SIL

</s> SIL

<sil> SIL

 7.生成a.phone

cd ..

      cd scripts

      cp make_phoneset.pl  /sphinx/Mytrain/etc

      ./make_phoneset.pl my_db.dic my_db.filler > my_db.phone

      cd ..

      cd etc

      vi my_db.phone //查看下生成的.phone中的内容

IAN

IB

IN

ING

OU

SIL

UAN

UI

UO

ZH

      以上若有多余字母，可手工删除。

      至此加上之前生成的a.lm.DMP，etc下的数据文件已准备完毕。

注意：以上所有的文件在输入结尾不可留下“/n”。

(3)准备wav下的音频文件

我们先在windows下面，录下前进、后退、左转、右转、停止五个命令词的wav音频文件，然后放在共享文件中。

 mkdir  speaker_1  speaker_2  speaker_3  speaker_4  speaker_5

把相应的音频1_11.wav 2_11.wav 3_11.wav 4_11.wav 5_11.wav 分别拷贝到相应的speaker_1  speaker_2  speaker_3  speaker_4  speaker_5 目录下面。

注意：此处音频文件采样率16KHz，单声道录音。

(4) an4与sphinxtrain搭建语音训练系统

  cd  F:\acoustic\train_test

  SphinxTrain/scripts_pl/setup_SphinxTrain.pl -task an4

pocketsphinx/scripts/setup_sphinx.pl -task an4

  在当前目录下面会产生如下目录：

bin

  bwaccumdir

etc

  feat

  logdir

  model_parameters

  model_architecture

  scripts_pl

wav

把(2)、(3)中etc、wav下面的文件相应的拷贝到刚生成的wav与etc中。

(5) 修改etc/sphinx_train.cfg配置

  1.路径

# These are filled in at configuration time

$CFG_DB_NAME = "an4";

$CFG_BASE_DIR ='F:/acoustic/train_test';

$CFG_SPHINXTRAIN_DIR = 'F:\acoustic\pocketsphinx';

改为

# These are filled in at configuration time

$CFG_DB_NAME = "a";

$CFG_BASE_DIR = 'F:/acoustic/train_test'

$CFG_SPHINXTRAIN_DIR = 'F:\acoustic\pocketsphinx';

  2.音频

# Audio waveform and feature file information

$CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav";

$CFG_WAVFILE_EXTENSION = 'sph';

$CFG_WAVFILE_TYPE = 'nist'; # one of nist, mswav, raw

改为

# Audio waveform and feature file information

$CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav";

$CFG_WAVFILE_EXTENSION = 'wav';

$CFG_WAVFILE_TYPE = 'mswav'; # one of nist, mswav, raw

3.文件命名

#*******variables used in main training of models*******

$CFG_DICTIONARY     = "$CFG_LIST_DIR/$CFG_DB_NAME.dic";

$CFG_RAWPHONEFILE   = "$CFG_LIST_DIR/$CFG_DB_NAME.phone";

$CFG_FILLERDICT     = "$CFG_LIST_DIR/$CFG_DB_NAME.filler";

$CFG_LISTOFFILES    = "$CFG_LIST_DIR/${CFG_DB_NAME}_train.fileids";

$CFG_TRANSCRIPTFILE = "$CFG_LIST_DIR/${CFG_DB_NAME}_train.transcription";

$CFG_FEATPARAMS     = "$CFG_LIST_DIR/feat.params";

确定etc下面的文件是以这种方式命名的。

4.模板类型与模板参数

#*******variables used in characterizing models*******

$CFG_HMM_TYPE = '.cont.'; # Sphinx III

#$CFG_HMM_TYPE  = '.semi.'; # Sphinx II

pocketsphinx和Sphinx III一样，都是.cont。

(6) 配置声音特征参数

这里的操作为采样率为8KHz的情况，若采样率为16KHz，则可以直接略过此步操作。

vi feat.params

添加修改如下：

-samprate 8000.0

-nfilt 31

-lowerf 200.00

-upperf 3500.00

-dither yes

(7) 配置解码参数

1.路径

# These are filled in at configuration time

$DEC_CFG_DB_NAME = 'an4';

$DEC_CFG_BASE_DIR = '/sphinx/MyTrain/an4';

$DEC_CFG_SPHINXDECODER_DIR = 'pocketsphinx';

改为

# These are filled in at configuration time

$DEC_CFG_DB_NAME = 'a';

$DEC_CFG_BASE_DIR = '/sphinx/a';

$DEC_CFG_SPHINXDECODER_DIR = 'pocketsphinx';

2、语言模型名称

# This variables, used by the decoder, have to be user defined, and

# may affect the decoder output

$DEC_CFG_LANGUAGEMODEL_DIR = "$DEC_CFG_BASE_DIR/etc";

$DEC_CFG_LANGUAGEMODEL  = "$DEC_CFG_LANGUAGEMODEL_DIR/an4.lm.DMP";

$DEC_CFG_LANGUAGEWEIGHT = "10";

$DEC_CFG_BEAMWIDTH = "1e-80";

$DEC_CFG_WORDBEAM = "1e-40";

改为

# This variables, used by the decoder, have to be user defined, and

# may affect the decoder output

$DEC_CFG_LANGUAGEMODEL_DIR = "$DEC_CFG_BASE_DIR/etc";

$DEC_CFG_LANGUAGEMODEL  = "$DEC_CFG_LANGUAGEMODEL_DIR/a.lm.DMP";

$DEC_CFG_LANGUAGEWEIGHT = "10";

$DEC_CFG_BEAMWIDTH = "1e-80";

$DEC_CFG_WORDBEAM = "1e-40";

(8) 开始训练

 cd F:\acoustic\train_test

./scripts_pl/make_feats.pl -ctl etc/a_train.fileids

./scripts_pl/make_feats.pl -ctl etc/a_test.fileids

./scripts_pl/RunAll.pl

在目录model_parameters下面生成了新的声学模型a.cd_cont_1000

使用训练好的语言模型与声学模型

在F:\acoustic新建文件夹test

将pocketsphinx_contious.exe,sphinxbase.dll,pocketsphinx.dll,a.dic,a.cd_cont_1000,a.lm.DMP拷贝到该test文件夹内，进行测试。

从命令行进入该文件夹执行

pocketsphinx_continuous -hmm a.cd_cont_1000 -lm a.lm.DMP -dict a.dic

便可以执行了

C:\Documents and Settings\Administrator>e:

E:\>cd bin

E:\bin>pocketsphinx_continuous -hmm a.cd_cont_1000 -lm a.lm.DMP -dict a.dicINFO: cmd_ln.c(691): Parsing command line:pocketsphinx_continuous \        -hmm a.cd_cont_1000 \        -lm a.lm.DMP \        -dict a.dic

Current configuration:[NAME]          [DEFLT]         [VALUE]-adcdev-agc            none            none-agcthresh      2.0             2.000000e+000-alpha          0.97            9.700000e-001-argfile-ascale         20.0            2.000000e+001-aw             1               1-backtrace      no              no-beam           1e-48           1.000000e-048-bestpath       yes             yes-bestpathlw     9.5             9.500000e+000-bghist         no              no-ceplen         13              13-cmn            current         current-cmninit        8.0             8.0-compallsen     no              no-debug                          0-dict                           a.dic-dictcase       no              no-dither         no              no-doublebw       no              no-ds             1               1-fdict-feat           1s_c_d_dd       1s_c_d_dd-featparams-fillprob       1e-8            1.000000e-008-frate          100             100-fsg-fsgusealtpron  yes             yes-fsgusefiller   yes             yes-fwdflat        yes             yes-fwdflatbeam    1e-64           1.000000e-064-fwdflatefwid   4               4-fwdflatlw      8.5             8.500000e+000-fwdflatsfwin   25              25-fwdflatwbeam   7e-29           7.000000e-029-fwdtree        yes             yes-hmm                            a.cd_cont_1000-infile-input_endian   little          little-jsgf-kdmaxbbi       -1              -1-kdmaxdepth     0               0-kdtree-latsize        5000            5000-lda-ldadim         0               0-lextreedump    0               0-lifter         0               0-lm                             a.lm.DMP-lmctl-lmname         default         default-logbase        1.0001          1.000100e+000-logfn-logspec        no              no-lowerf         133.33334       1.333333e+002-lpbeam         1e-40           1.000000e-040-lponlybeam     7e-29           7.000000e-029-lw             6.5             6.500000e+000-maxhmmpf       -1              -1-maxnewoov      20              20-maxwpf         -1              -1-mdef-mean-mfclogdir-min_endfr      0               0-mixw-mixwfloor      0.0000001       1.000000e-007-mllr-mmap           yes             yes-ncep           13              13-nfft           512             512-nfilt          40              40-nwpen          1.0             1.000000e+000-pbeam          1e-48           1.000000e-048-pip            1.0             1.000000e+000-pl_beam        1e-10           1.000000e-010-pl_pbeam       1e-5            1.000000e-005-pl_window      0               0-rawlogdir-remove_dc      no              no-round_filters  yes             yes-samprate       16000           1.600000e+004-seed           -1              -1-sendump-senlogdir-senmgau-silprob        0.005           5.000000e-003-smoothspec     no              no-svspec-time           no              no-tmat-tmatfloor      0.0001          1.000000e-004-topn           4               4-topn_beam      0               0-toprule-transform      legacy          legacy-unit_area      yes             yes-upperf         6855.4976       6.855498e+003-usewdphones    no              no-uw             1.0             1.000000e+000-var-varfloor       0.0001          1.000000e-004-varnorm        no              no-verbose        no              no-warp_params-warp_type      inverse_linear  inverse_linear-wbeam          7e-29           7.000000e-029-wip            0.65            6.500000e-001-wlen           0.025625        2.562500e-002

INFO: cmd_ln.c(691): Parsing command line:\        -alpha 0.97 \        -dither yes \        -doublebw no \        -nfilt 31 \        -ncep 13 \        -lowerf 200.00 \        -upperf 3500.00 \        -nfft 512 \        -wlen 0.0256 \        -transform legacy \        -feat 1s_c_d_dd \        -agc none \        -cmn current \        -varnorm no

Current configuration:[NAME]          [DEFLT]         [VALUE]-agc            none            none-agcthresh      2.0             2.000000e+000-alpha          0.97            9.700000e-001-ceplen         13              13-cmn            current         current-cmninit        8.0             8.0-dither         no              yes-doublebw       no              no-feat           1s_c_d_dd       1s_c_d_dd-frate          100             100-input_endian   little          little-lda-ldadim         0               0-lifter         0               0-logspec        no              no-lowerf         133.33334       2.000000e+002-ncep           13              13-nfft           512             512-nfilt          40              31-remove_dc      no              no-round_filters  yes             yes-samprate       16000           1.600000e+004-seed           -1              -1-smoothspec     no              no-svspec-transform      legacy          legacy-unit_area      yes             yes-upperf         6855.4976       3.500000e+003-varnorm        no              no-verbose        no              no-warp_params-warp_type      inverse_linear  inverse_linear-wlen           0.025625        2.560000e-002

INFO: acmod.c(246): Parsed model-specific feature parameters from a.cd_cont_1000/feat.paramsINFO: fe_interface.c(289): You are using the internal mechanism to generate theseed.INFO: feat.c(684): Initializing feature stream to type: '1s_c_d_dd', ceplen=13,CMN='current', VARNORM='no', AGC='none'INFO: cmn.c(142): mean[0]= 12.00, mean[1..12]= 0.0INFO: mdef.c(517): Reading model definition: a.cd_cont_1000/mdefINFO: bin_mdef.c(179): Allocating 166 * 8 bytes (1 KiB) for CD treeINFO: tmat.c(205): Reading HMM transition probability matrices: a.cd_cont_1000/transition_matricesINFO: acmod.c(121): Attempting to use SCHMM computation moduleINFO: ms_gauden.c(198): Reading mixture gaussian parameter: a.cd_cont_1000/means

INFO: ms_gauden.c(292): 105 codebook, 1 feature, size:INFO: ms_gauden.c(294):  8x39INFO: ms_gauden.c(198): Reading mixture gaussian parameter: a.cd_cont_1000/variancesINFO: ms_gauden.c(292): 105 codebook, 1 feature, size:INFO: ms_gauden.c(294):  8x39INFO: ms_gauden.c(354): 19025 variance values flooredINFO: acmod.c(123): Attempting to use PTHMM computation moduleINFO: ms_gauden.c(198): Reading mixture gaussian parameter: a.cd_cont_1000/means

INFO: ms_gauden.c(292): 105 codebook, 1 feature, size:INFO: ms_gauden.c(294):  8x39INFO: ms_gauden.c(198): Reading mixture gaussian parameter: a.cd_cont_1000/variancesINFO: ms_gauden.c(292): 105 codebook, 1 feature, size:INFO: ms_gauden.c(294):  8x39INFO: ms_gauden.c(354): 19025 variance values flooredINFO: ptm_mgau.c(804): Number of codebooks doesn't match number of ciphones, doesn't look like PTM: 105 != 16INFO: acmod.c(125): Falling back to general multi-stream GMM computationINFO: ms_gauden.c(198): Reading mixture gaussian parameter: a.cd_cont_1000/means

INFO: ms_gauden.c(292): 105 codebook, 1 feature, size:INFO: ms_gauden.c(294):  8x39INFO: ms_gauden.c(198): Reading mixture gaussian parameter: a.cd_cont_1000/variancesINFO: ms_gauden.c(292): 105 codebook, 1 feature, size:INFO: ms_gauden.c(294):  8x39INFO: ms_gauden.c(354): 19025 variance values flooredINFO: ms_senone.c(160): Reading senone mixture weights: a.cd_cont_1000/mixture_weightsINFO: ms_senone.c(211): Truncating senone logs3(pdf) values by 10 bitsINFO: ms_senone.c(218): Not transposing mixture weights in memoryINFO: ms_senone.c(277): Read mixture weights for 105 senones: 1 features x 8 codewordsINFO: ms_senone.c(331): Mapping senones to individual codebooksINFO: ms_mgau.c(141): The value of topn: 4INFO: dict.c(317): Allocating 4105 * 20 bytes (80 KiB) for word entriesINFO: dict.c(332): Reading main dictionary: a.dicINFO: dict.c(211): Allocated 0 KiB for strings, 0 KiB for phonesINFO: dict.c(335): 5 words readINFO: dict.c(341): Reading filler dictionary: a.cd_cont_1000/noisedictINFO: dict.c(211): Allocated 0 KiB for strings, 0 KiB for phonesINFO: dict.c(344): 3 words readINFO: dict2pid.c(396): Building PID tables for dictionaryINFO: dict2pid.c(404): Allocating 16^3 * 2 bytes (8 KiB) for word-initial triphonesINFO: dict2pid.c(131): Allocated 3136 bytes (3 KiB) for word-final triphonesINFO: dict2pid.c(195): Allocated 3136 bytes (3 KiB) for single-phone word triphonesINFO: ngram_model_arpa.c(77): No \data\ mark in LM fileINFO: ngram_model_dmp.c(142): Will use memory-mapped I/O for LM fileINFO: ngram_model_dmp.c(196): ngrams 1=8, 2=10, 3=13INFO: ngram_model_dmp.c(242):        8 = LM.unigrams(+trailer) readINFO: ngram_model_dmp.c(291):       10 = LM.bigrams(+trailer) readINFO: ngram_model_dmp.c(317):       13 = LM.trigrams readINFO: ngram_model_dmp.c(342):        4 = LM.prob2 entries readINFO: ngram_model_dmp.c(362):        5 = LM.bo_wt2 entries readINFO: ngram_model_dmp.c(382):        3 = LM.prob3 entries readINFO: ngram_model_dmp.c(410):        1 = LM.tseg_base entries readINFO: ngram_model_dmp.c(466):        8 = ascii word strings readINFO: ngram_search_fwdtree.c(99): 5 unique initial diphonesINFO: ngram_search_fwdtree.c(147): 0 root, 0 non-root channels, 4 single-phone wordsINFO: ngram_search_fwdtree.c(186): Creating search treeINFO: ngram_search_fwdtree.c(191): before: 0 root, 0 non-root channels, 4 single-phone wordsINFO: ngram_search_fwdtree.c(326): after: max nonroot chan increased to 138INFO: ngram_search_fwdtree.c(338): after: 5 root, 10 non-root channels, 3 single-phone wordsINFO: ngram_search_fwdflat.c(156): fwdflat: min_ef_width = 4, max_sf_win = 25INFO: continuous.c(371): pocketsphinx_continuous COMPILED ON: Dec 15 2011, AT: 17:22:14

Allocating 32 buffers of 2500 samples eachREADY....Listening...Stopped listening, please wait...INFO: cmn_prior.c(121): cmn_prior_update: from <  8.00  0.00  0.00  0.00  0.000.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00 >INFO: cmn_prior.c(139): cmn_prior_update: to   <  8.36 -0.22 -0.16 -0.18 -0.05 -0.20 -0.13 -0.12 -0.11 -0.09 -0.13 -0.12 -0.09 >INFO: ngram_search_fwdtree.c(1549):       77 words recognized (1/fr)INFO: ngram_search_fwdtree.c(1551):     1389 senones evaluated (15/fr)INFO: ngram_search_fwdtree.c(1553):      621 channels searched (6/fr), 380 1st,241 lastINFO: ngram_search_fwdtree.c(1557):      241 words for which last channels evaluated (2/fr)INFO: ngram_search_fwdtree.c(1560):        0 candidate words for entering last phone (0/fr)INFO: ngram_search_fwdtree.c(1562): fwdtree 0.05 CPU 0.052 xRTINFO: ngram_search_fwdtree.c(1565): fwdtree 1.94 wall 2.153 xRTINFO: ngram_search_fwdflat.c(305): Utterance vocabulary contains 1 wordsINFO: ngram_search_fwdflat.c(940):       84 words recognized (1/fr)INFO: ngram_search_fwdflat.c(942):      267 senones evaluated (3/fr)INFO: ngram_search_fwdflat.c(944):      260 channels searched (2/fr)INFO: ngram_search_fwdflat.c(946):      260 words searched (2/fr)INFO: ngram_search_fwdflat.c(948):       26 word transitions (0/fr)INFO: ngram_search_fwdflat.c(951): fwdflat 0.00 CPU 0.000 xRTINFO: ngram_search_fwdflat.c(954): fwdflat 0.00 wall 0.000 xRTINFO: ngram_search.c(1206): </s> not found in last frame, using <s>.88 insteadINFO: ngram_search.c(1258): lattice start node <s>.0 end node <s>.0INFO: ngram_search.c(1286): Eliminated 0 nodes before end nodeINFO: ngram_search.c(1391): Lattice has 1 nodes, 0 linksINFO: ps_lattice.c(1365): Normalizer P(O) = alpha(<s>:0:88) = -537002291000000000: (null)READY....

但是执行结果不好，说的小为null，有时null直接什么都没有即

000000000:
READY....

不清楚什么问题