IR复习(1)

来源:互联网 发布:非涉密网络监测设备? 编辑:程序博客网 时间:2024/06/05 02:12

IR也就是information retrieval 信息检索


text mining: (1)exploit information (2)discover pattern/ trends

how to speed up text mining? index

Vector space: (1)bag of words (2)every doc is a vector (3)terms are axes


inverted index: store documents contain term t

query optimization: start with smallest set to find results, then keep cutting further

structured data: information in "table"  

unstructured data: free text


stemming: crude affix chopping

postings with skip pointers:(1) long skips, fewer success skips (2)short skips, more comparisions

How to search phrase queries:(1)biword index (2)extended biword (3)positional index 


context-sensitive correction: (1)one word fixed ar a time.(edit distance, n-gram overlap) (2)conjunction of biwords

soundex phonetic matching


Compress data on disk, decompress data during transfer.

BSBI(blocked sort-based index): accumulate postings for each block, then merge the blocks

SPIMI(single-pass in-memory index): token <- next token_stream

                                                                    token ->(1)AddToDictionary or (2)GetPostingList

                                                                    AddToPostingList


Dictionary as a string: (1)Total string length is far shorter than original version

                                         (2)Store pointers to every kth term string


tf-idf weighting: w = (1 + log TF) * log N/DF

idf has no effect on ranking one term queries.


For vector space vector comparison, use angle instead of distance, cause' Euclidean distance is large for vectors of different length


Index elimination:(1)only consider high idf terms (2)only consider docs containing many query terms

Champion list: the r docs of highest weight in t's postings

If get more than k docs from champion list

          select top K and stop

else

    proceed to get docs from low lists




0 0
原创粉丝点击