IR复习(1)

来源：互联网发布：非涉密网络监测设备? 编辑：程序博客网时间：2024/06/05 02:12

IR也就是information retrieval 信息检索

text mining: (1)exploit information (2)discover pattern/ trends

how to speed up text mining? index

Vector space: (1)bag of words (2)every doc is a vector (3)terms are axes

inverted index: store documents contain term t

query optimization: start with smallest set to find results, then keep cutting further

structured data: information in "table"

unstructured data: free text

stemming: crude affix chopping

postings with skip pointers:(1) long skips, fewer success skips (2)short skips, more comparisions

How to search phrase queries:(1)biword index (2)extended biword (3)positional index

context-sensitive correction: (1)one word fixed ar a time.(edit distance, n-gram overlap) (2)conjunction of biwords

soundex phonetic matching

Compress data on disk, decompress data during transfer.

BSBI(blocked sort-based index): accumulate postings for each block, then merge the blocks

SPIMI(single-pass in-memory index): token <- next token_stream

token ->(1)AddToDictionary or (2)GetPostingList

AddToPostingList

Dictionary as a string: (1)Total string length is far shorter than original version

(2)Store pointers to every kth term string

tf-idf weighting: w = (1 + log TF) * log N/DF

idf has no effect on ranking one term queries.

For vector space vector comparison, use angle instead of distance, cause' Euclidean distance is large for vectors of different length

Index elimination:(1)only consider high idf terms (2)only consider docs containing many query terms

Champion list: the r docs of highest weight in t's postings

If get more than k docs from champion list

select top K and stop

else

proceed to get docs from low lists

0 0