Language model of IR

来源:互联网 发布:mac系统ps 编辑:程序博客网 时间:2024/04/30 08:16

    Traditional IR is divided into 2 parts: index and retrival. Things connect them is words counts: tf and idf. Whatever models traditional IR is(vector, probabilistic...), the core context of representing interconnection between query and documents is words counts, e.g. tf&idf. Evaluation is Similarity function Sim(q,d).


    What is the language model? That is a concept of speech recognition which model regularity of language. But what's matter with IR? I left this later. Now let's consider a example in machine translation: given a sentence S, we need to evaluate its translation T. argmax P(T|S) can do this statistically which is argmax P(S|T)P(T). P(T) is language model, P(S|T) is channel probability. The same as machine translation, IR can do this analogically. Given a query, find documents. argmaxd P(d|q) = argmaxd P(q|d)P(d). P(d) is useful when you want to create a profile to represent a user's interesting and it can be acted as relevance feedback way.

    What does the P(q|d) represent? It can be viewed as a human mental process that "product" query according to a document. Ranking docments given a query base on P(q|d), argmaxd (P(q|d)) is the document  we wanted.

    Evaluating P(q|d) has two ways generally, one is based on words count and smoothing with a "background" in order to avoid zero-probability because words in query does not necessary to appear in document. The other is evaluate KL-divergence between probability parameter øq and ød which view both query and document are mental process, if parameter øq is close to parameter ød means these two mental processes are similar, then documents under corresponding parameter ød is we wanted.

    For method one, how to estimate a good P(q|d) is a problem, you need to avoid zero-probability and make them rational. Simplest way is smoothing but there are some other ways: Definition of P(q|d) depends on designer which can be expand to P(q|h)P(h|d), h is hypotheses. In practice, h is word usually. Some paper view P(q|h) as a  translation model, P(h|d) as a language model. whatever they called, they constitute a smoothing in fact. Part P(q|h) exclusively works on zero-probability problem with a training process(of course it is heavy work).

    For method two, zero-probability aslo is a problem because KL-divergence works well when two functions has similar domain. Apparently query's words is fewer than document's words. How to complement it is a problem.

    This extension of P(q|d) has benefit of flexibilty at the price of complexity but it is worth to do so. For example, translation model can be put on some personal interesting which implemented by EM algorithm(this algorithm needs training examples that can be supported by individual so that reflects personal property), on the other hand, language model maintains a invariable function.

know more at:
http://www.lemurproject.org/lemur/background.php