IR复习(1)
来源:互联网 发布:非涉密网络监测设备? 编辑:程序博客网 时间:2024/06/05 02:12
IR也就是information retrieval 信息检索
text mining: (1)exploit information (2)discover pattern/ trends
how to speed up text mining? index
Vector space: (1)bag of words (2)every doc is a vector (3)terms are axes
inverted index: store documents contain term t
query optimization: start with smallest set to find results, then keep cutting further
structured data: information in "table"
unstructured data: free text
stemming: crude affix chopping
postings with skip pointers:(1) long skips, fewer success skips (2)short skips, more comparisions
How to search phrase queries:(1)biword index (2)extended biword (3)positional index
context-sensitive correction: (1)one word fixed ar a time.(edit distance, n-gram overlap) (2)conjunction of biwords
soundex phonetic matching
Compress data on disk, decompress data during transfer.
BSBI(blocked sort-based index): accumulate postings for each block, then merge the blocks
SPIMI(single-pass in-memory index): token <- next token_stream
token ->(1)AddToDictionary or (2)GetPostingList
AddToPostingList
Dictionary as a string: (1)Total string length is far shorter than original version
(2)Store pointers to every kth term string
tf-idf weighting: w = (1 + log TF) * log N/DF
idf has no effect on ranking one term queries.
For vector space vector comparison, use angle instead of distance, cause' Euclidean distance is large for vectors of different length
Index elimination:(1)only consider high idf terms (2)only consider docs containing many query terms
Champion list: the r docs of highest weight in t's postings
If get more than k docs from champion list
select top K and stop
else
proceed to get docs from low lists
- IR复习(1)
- IR复习(2)
- IR
- ir
- IR
- ir
- llvm之IR手册翻译(1)
- 初探IR [1] 向量空间模型 Vector Space Model
- IR资料
- GR/IR
- GR/IR
- GR/IR
- IR 遥控器
- Verilog IR
- GR/IR
- IR介绍
- VEX IR
- ir-runner
- 50个Android开发技巧(03 自定义ViewGroup)
- 2014 JAVA软件工程师发展趋势
- github玩起来
- 进程间通信第二章(Posix IPC)
- SnagIt
- IR复习(1)
- Sublime Text 3.0-3059 MAC 、windows、Linux下的破解方法
- 全屏显示. Tween动画
- python string模块
- hdoj 2209 翻纸牌游戏(BFS + 位运算)
- Lwip 断连,连接几次后不通及偶尔不通的问题.
- c++中类的公有和私有初探01
- AJAX组件-dwr 的部署与实现
- Android TableLayout表格布局详解