Lucene学习记(1) Lucene的评分机制

score(q,d)   =   coord(q,d)  ·  queryNorm(q)  ·   ( tf(t in d)  ·  idf(t)2  ·  t.getBoost() ·  norm(t,d) )   t in q



  • tf(t in d)

tf(t in d)   =    frequency½

2. idf(t)
表示的是反转文档频率( Inverse Document Frequency).这个函数表示的是(t:term)在所有文档中一共在多少个文档中出现过。因为文档出现的次数越少就越容易定位,所以文档数越少,得分就越高。这个函数的默认计算公式如下:

idf(t)  =    1 + log ( numDocs ––––––––– docFreq+1 )


4.queryNorm(q) 这个函数是一个调节因子,不影响具体的排序情况。主要是用来让排序结果在不同的查询条件(或者不同的索引)之间可以比较。这个条件是在搜索的时候计算的。它的计算公式如下:   The sum of squared weights (查询条件的terms)是由查询的权重对象计算的。不同的查询方式,有不同的计算方法。例如:Boolean query的计算公式如下:

queryNorm(q)   =   queryNorm(sumOfSquaredWeights)   =    1 –––––––––––– sumOfSquaredWeights½


sumOfSquaredWeights   =   q.getBoost() 2  ·   ( idf(t)  ·  t.getBoost() ) 2 t in q 5.norm(t,d)
这个函数得到的是一些建索引的时候得到的一些参数计算值:encapsulates a few (indexing time) boost and length factors:

      Document boost文档的boost,是建索引的时候设置的文档得分。

    • Field boost在将一个字段加入到文档中去的时候加入的字段得分。(不同的字段得分不同有利于排序,例如标题的得分应该比内容的得分要高等)

    • lengthNorm(field)文档在建立索引的时候加入的一个参数,根据文档的某个字段含有的Term数量来计算的。Term数量比较少的字段将得到更多的得分。这个函数是由Similarity类在建立索引的时候计算的。 (1/numTerms*numTerms)

norm(t,d)=doc.getBoost()·lengthNorm(field)  ·   f.getBoost()   field f in d named as t



DefaultSimilarity基本上可以满足一般的搜索要求。但是在有些应用中,你可以定制你自己的Similarity来服务你自己的应用需求。例如:有些人认为没有必要让文档短的文章得分更高一点 (参考 a "fair" similarity).

你如果想知道,别人都是怎么修改similarity的,你可以参考一下Lucene的邮件列表Overriding Similarity. 总的来说有下面这些修改:
    SweetSpotSimilarity -- SweetSpotSimilarity gives small increases as the frequency increases a small amount and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.
    Overriding tf -- In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the tf() method.
  • Changing Length Normalization -- By overriding lengthNorm, it is possible to discount how the length of a field contributes to a score. In DefaultSimilarity, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated "fairly".



    Query -- The abstract object representation of the user's information need.
    Weight -- The internal interface representation of the user's Query, so that Query objects may be reused.
  • Scorer -- An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities.

The Query Class
    createWeight(Searcher searcher) -- A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
  • rewrite(IndexReader reader) -- Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, OTHERS????
The Weight Interface
Weight 接口


    Weight#getQuery() -- Pointer to the Query that this Weight represents.
    Weight#getValue() -- The weight for this Query. For example, the TermQuery.TermWeight value is equal to the idf^2 * boost * queryNorm
    Weight#sumOfSquaredWeights() -- The sum of squared weights. Tor TermQuery, this is (idf * boost)^2
    Weight#normalize(float) -- Determine the query normalization factor. The query normalization may allow for comparing scores between queries.
    Weight#scorer(IndexReader) -- Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
  • Weight#explain(IndexReader, int) -- Provide a means for explaining why a given document was scored the way it was.
The Scorer Class

    Scorer#next() -- Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
    Scorer#doc() -- Returns the id of the Document that contains the match. Is not valid until next() has been called at least once.
    Scorer#score() -- Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer returns the tf * Weight.getValue() * fieldNorm.
    Scorer#skipTo(int) -- Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, skipTo can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
  • Scorer#explain(int) -- Provides details on why the score came about.

t.getBoost() Term的权重。这个是在搜索的时候设置的,用户可以在查询条件中设置,也可以由应用程序设置。