线程安全的SRILM语言模型c++接口

来源：互联网发布：淘宝卖家咨询客服电话编辑：程序博客网时间：2024/04/28 23:52

博客地址：http://blog.csdn.net/wangxinginnlp/article/details/46963659

老版本线程不安全

最近几天，在倒腾多线程的翻译解码器。单线程没问题的解码器，放在多线程下就经常无故的 segmentation fault (core dumped)。排查了一天原因，才发现是语言模型的问题。

老版本的SRILM不支持多线程，多个进程环境下报错。错误具体表现如下：

将语言模型作为公共资源，多个线程去读取，会报segmentation fault (core dumped)。
将语言模型作为线程资源，多个线程各自去读取然后使用。发现只有第一个线程能够成功加载语言模型资源，其他语言模型加载语言资源失败。程序不会报错，但是翻译结果中所有的语言模型得分为0。
将语言模型作为线程资源，进程先准备多个语言模型资源（就是new多个对象）。然后分发给各个线程，供线程使用。这个时候会报错segmentation fault (core dumped)。

总而言之，老版本的SRILM在多线程下无法成功使用。

判断自己的SRILM是否是老版本，查看自己的SRILM接口。如果读取资源和给word进行条件概率打分分别为

void *sriLoadLM(const char *fn, int arpa = 0, int order = 3, int unk = 0, int tolow=0);
double sriWordProb(void *plm, const char *word, const char *context);

恭喜你，你的SRILM是老版本的。

新版本线程安全

现在问题是，怎么确定新版本线程安全的？

现在去SRILM官网（http://www.speech.sri.com/projects/srilm/）下载的新版本，解压压缩包后在根目录的doc目录下有一个README-THREADS。第一段是这么描述的

As of November, 2012 SRILM supports multi-threaded applications. This enhancment applies to the five libraries thatcomprise SRILM: libmisc, libdstruct, liboolm, libflm and liblattice. Please note that this does not imply that all APImethods arethread-safe, but rather that it is possible to perform independent SRILM tasks on multiple threadswithout interference or instability. Some APIs that perform read-only calculations may be safe to call on objectsshared by multiple threads but in general this is not safe, particularly on APIs that mutate data structures notsolely owned by the current thread.We will attempt to document specific allowances and limitations within this READMEand inline in the code.

黑体字是重点，简单说就是新版本SRILM是读安全，写不一定安全。

但是比较坑是谷歌“SRILM接口” “SRILM API” 等都无法得到官方的接口（查到的都是老版本的接口），唯一的例外就是（http://blog.csdn.net/mouxiaofeng/article/details/5144750）。

后来发现在根目录下的doc目录下有lm-intro文件。有如下这么一段话

API FOR LANGUAGE MODELS
These programs are just examples of how to use the object-oriented language model library currently under construction. To use the API one would have to read the various .h files and how the interfaces are used in the example progams. Nocomprehensive documentation is available as yet. Sorry.

黑体字是重点，简单说就是官方不提供接口。

修改接口

万幸的是，在Github上找到一个人写的python和perl接口：见https://github.com/desilinguist/swig-srilm（后称 python版接口）。但是这个版本给出的计算概率接口都是n-gram的概率：getUnigramProb，getBigramProb，getTrigramProb，getNgramProb或者是算句子概率的getSentenceProb。而在我们解码器中，需要是在给定语言模型后，给定word，计算contex条件下的概率。下面我们就在上python版接口的基础上改一个C++的接口。

查看python版接口的srilm.c文件，很容易知道其语言模型是一个Ngram类型。查看python版接口incldue中Ngram.h，发现Ngram公共继承LM，并且他有一个wordProb(VocabIndex word, const VocabIndex *context)接口。大喜，这个接口就是我们需要的。

如果好奇LM是什么，查看python版接口incldue中LM.h文件。

第一段是官方的文档

LM.h --
* Generic LM interface
* The LM class defines an abstract languge model interface which all other classes refine and inherit from.

继续看，会发现他也有 LogP wordProb(VocabIndex word, const VocabIndex *context) = 0，而且他还有LogP wordProb(VocabString word, const VocabString *context)接口。

直觉上VocabIndex是word的编号，VocabString就是string类的word。猜测不全对。查看Vocab.h，有如下

我们所追求的就是LM中wordProb(VocabString word, const VocabString *context)。如果现在经受不住考验，直接python版接口中srilm去做一个LM类包装，你会发现LM的read()函数报错，因为它根本没有实现。查看解压后SRILM目录中lm/src/LM.cc文件。

虽然他实现了wordProb(VocabString word, const VocabString *context)。

虽然他有一个可能的写操作，但是addUnkWords函数默认是flase

其实这个接口用起来没问题，个人一直不大会倒腾wordProb(VocabString word, const VocabString *context)第二个参数中那个多维数组char**。

我自己的解决方案是想办法把Ngram的WordProb合理利用起来。查看srilm.c中计算n-gram概率，无非就是先把n切分，然后去vocab中查每个word的index，最后送去计算。

<span style="font-size:18px;">// get generic n-gram probability (up to n=7) float getNgramProb(Ngram* ngram, const char* ngramstr, unsigned order) {     const char* words[7];     unsigned int indices[order];     int numparsed, histsize, i, j;     char* scp;     float ans;     // Duplicate string so that we don't mess up the original     scp = strdupa(ngramstr);     // Parse the given string into words     numparsed = Vocab::parseWords(scp, (VocabString *)words, 7);                                      //切分     if(numparsed != order) {         fprintf(stderr, "Error: Given order (%d) does not match number of words (%d).\n", order, numparsed);         return 0;     }     // Get indices for the words obtained above, if you don't find them, then add them     // to the vocabulary and then get the indices.     swig_srilm_vocab->addWords((VocabString *)words, (VocabIndex *)indices, order);                  //查word的index（此处写操作，线程不安全）     // Create a history array of size "order" and populate it                                        //计算概率     unsigned hist[order];     for(i=order; i>1; i--) {         hist[order-i] = indices[i-2];     }     hist[order-1] = Vocab_None;     // Compute the ngram probability     ans = getWordProb(ngram, indices[order-1], hist);     // Return the representation of log(0) if needed     if(ans == LogP_Zero)        return BIGNEG;    return ans;}</span>

上面指出Vocab类addWords有写操作，线程不安全，推荐使用getIndices，只有读操作，线程安全。

[补充]

函数声明为：

virtual unsigned int getIndices(const VocabString *words, VocabIndex *wids, unsigned int max, VocabIndex unkIndex = Vocab_None);

如果是用getIndices，训练模型时候千万要注意带-unk。博客http://blog.csdn.net/zhoubl668/article/details/7759042在训练时候就没有带unk标记。

关于unk和map-unk，容易混淆（我自己都混淆了）

训练中ngram-count命令中unk的选项解释：

-unk: keep <unk> in LM

训练阶段ngram-count中unk标记参见http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html

-unk

Build an ``open vocabulary'' LM, i.e., one that contains the unknown-word token as a regular word. The default is to remove the unknown word.

-map-unk word

Map out-of-vocabulary words to word, rather than the default <unk> tag.

至于为什么要带unk，查看http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html 的如下部分

D2) What happens when the OOV word is in the context of an N-gram?

Exact details depend on the discounting algorithm used, but typically the backed-off probability from a lower order N-gram is used. If the -unk option is used as explained below, an <unk> token is assumed to take the place of the OOV word and no back-off may be necessary if a corresponding N-gram containing <unk> is found in the LM.

D3) Isn't it wrong to assign 0 logprob to OOV words?

That depends on the application. If you are comparing multiple language models which all consider the same set of words as OOV it may be OK to ignore OOV words. Note that perplexity comparisons are only ever meaningful if the vocabularies of all LMs are the same. Therefore, to compare LMs with different sets of OOV words (such as when using different tokenization strategies for morphologically complex languages) then it becomes important to take into account the true cost of the OOV words, or to model all words, including OOVs.

具体地，getIndices中getIndex函数查找word的index，遇到oov时候会返回unk的index，然后给出得分。如果训练时候不带unk，遇到oov，~~在得分时候会出给-inf（没有深究为什么）~~。

使用getIndices遇到oov时候报错：../../include/LHash.cc:273: Boolean LHash<KeyT, DataT>::locate(KeyT, unsigned int&) const [with KeyT = unsigned int, DataT = float]: Assertion `!Map_noKeyP(key)' failed.

使用addWords时候能够给出一个很小的模型分数。-99？

如果训练带unk的语言模型，参考：http://www.cs.cmu.edu/~tanja/11-753/Lectures-Thomas/Exercises_Solutions/Session8/exercise8.html

命令：

ngram-count -order 2 -lm train.2.arpabo.gz -text data/CH/trl.utf8.set/trl.utf8.train -unk -map-unk "<UNK>" -kndiscount -interpolate

Vocab.h中

Vocab.cc中

addWords在找不到word时候会将word写入vocab，貌似getIndices不会有（其函数实现中在有条件下也有addWord操作，等后面看清楚再确认下。实验表明是线程安全的。）。

Ngram中wordProb是需要word和context的index信息去计算，并且这个函数自带back-off功能

我的想法是word直接查找index。对context同样进行切分，查找index，形成index数组。

现在的担心的地方是，我的contex是不定长的(不像Python接口中那种写死要指定长度），所以其index数组有效位数不定长，那么在送去计算概率时候会不会受影响？

具体地，自己先处理自己字符串，给出长度不超过n-gram的contex信息。然后将word和contex送入接口进行概率计算。容易出问题地方就是contex的长度小于n时候是否会出错。结合srilm.cc中getNgramProb关注三个问题：Vocab::parseWords(scp, (VocabString *)words, order) 中words长度设置为7，切分一个长度小于n-gram的的context，无效的数组元素怎么填？举例：一个context含有4个单词，切分后，words中前4个是其char型的单词，多出后面3个怎么处理？
getIndices中在上一步的word无效位置怎么处理？ getNgramPro中设置的 unsigned int indices[order]。在contex不足order元是否，如果对indices多出的无效部分处理？
wordProb中对送入的有效位数不同的indices进行处理？

对于，长度不足max的，words[i]置0;

长度不足max的，index数组设置Vocab_None;

接口代码

https://github.com/hsing-wang/SRILM_interface

编译接口

建一个SRILM c++接口目录SRILM_interface。
准备include和lib资源。根据python版接口的介绍，编译接口需要准备相关的静态库和头文件。直接将编译通过的SRILM工具中的根目录include目录和lib目录拷贝至目录SRILM_interface下。或者在编译时候指定路径也行。
将我们改写的srilm.h和srilm.cc，和main.cc放在SRILM_interface中。
编译：g++ srilm.h srilm.cc main.cc -I ./include ./lib/liboolm.a ./lib/libdstruct.a ./lib/libmisc.a ./lib/liblattice.a ./lib/libflm.a -lpthread

如果通过，自己在运行编译结果测试下。如果测试结果正确，c++接口就OK了。

实验：

关于unk和addWords关系，没有深入看代码我自己都弄晕。干脆做个实验，控制变量法来进行。

1. 训练两个语言模型，一个带unk标记，另外一个不带unk标记。

命令分别是：

./ngram-count -order 3 -lm srilm.test.o3.withUnk.gz -text srilm.test -unk -kndiscount -interpolate

./ngram-count -order 3 -lm srilm.test.o3.withoutUnk.gz -text srilm.test -kndiscount -interpolate

两个文件的不同之处可以grep查看下：

2. 在使用addWords时候：

1 0