A Brief Introduction to Language Modeling

来源：互联网发布：js给span赋值编辑：程序博客网时间：2024/04/26 11:03

Language modeling plays a pivotal role in automatic speech recognition and can be thought as a way to impose a collection of constrains on words sequences. Generally, many different such sequences can be used to convey the same information, and these constrains tend to be statistical in nature. Thus, regularities in natural language are governed by an underlying (unknown) probability distribution on word sequences.

Basic Introduction to Language Modeling
N-gram language model is one of the most popular and empirical method to modeling. The formats of language model are various and we choose the ARPA format so that we could have a look at the content of the model. Generally, we choose 4-gram as our target language model. The higher the n, the more information the language model contains.
An APRA language model includes information about the word sequences, this distribution probability and the back-off weight. What we need to do is make the probability distribution comprehensive, reasonable and match our task at the same time.

Elements of language modeling
- Corpus
  For an ASR system, corpus we need are choosing depending on our application. In our system, we need to make it more related to the oral context because we need to apply it to Wechat or Tencent. In theory, we need to gather as much oral corpus as possible. However, it’s even ten times to collect oral corpus than other text corpus.
- Dictionary
  Dictionary also plays a basic but significant role in language modeling. During the whole process, two types of dictionary are needed, and we will present them separately.
- Dictionary of frequency
  Dictionary of frequency is needed when we building a model. The first usage of this dictionary is to parsing corpus with our dictionary so that words and phrases can be constrained in rules. Or else, it would be recognized as an unknown symbol. Secondly, when we need to build a language model with a fixed vocabulary, this dictionary becomes constrain to our language model.
- Dictionary of lexicon
  Exactly speaking, dictionary of lexicon doesn’t belong to the language modeling. However, to make a decoding lattice, we need to get the dictionary prepared.
In our system, all the elements in the dictionary are selected from our corpus. That is to say, our dictionary and corpus are related.

Basic tools
- SRILM
  SRILM is a tool to help us build language model. Basic operations and commands will be listed.
  - Ngram-count* –text training_data –order n –lm out_lm
  - Ngram
- Jieba
  Jieba is one of the popular parsing library in python.Just download by typing: easy_install jieba.
  Whenever use it, make sure that new word will not be detected and you dictionary of frequency has taken replace of the old one.
  Basic operations will be presented in the reference url.
  Ref URL: http://www.oschina.net/p/jieba

Key techniques in language modeling
- Adaptation
  Natural language is highly variable in several aspect. When there comes a mismatch between the training set and the test set, we may get a high PP and the result will be really poor.
  To solve the problem we need to adapt all the corpus to our task set and make our language model more generalized.
  There are many methods of adaptation. The simplest way is linear interpolation and this is what we are adopting, but it doesn’t work so well. Latent semantic analysis (LAS) will be our next method to dealing with adaptation in multiple corpus.
- Smoothing
  When new phrases appear in our task text, the probability of it should be zero but as a matter of fact, we cannot assign zero to the phrases because it could be combined by many unigram words.
  Smoothing is just the method to tackle this problem. Methods like Good-turing, Written-bell and Knersay-Ney are all empirical and useful. But it is proved that as the size of the corpus changes, the best methods to smoothing changes as well. The performance of the smoothing techniques also depends on the adaptation.

The brief introduction about language modeling comes to the end and if you are interested in any part of it. Contact memory at jia.von@foxmail.com and you will get specific papers on it.

Note : 版权归作者所有，转载请注明出处。

0 0