搜索学习基础--倒排索引的过程解读

来源:互联网 发布:大量淘宝买家手机号码 编辑:程序博客网 时间:2024/05/22 00:39

下面是两篇文档,我们要对其建立索引

Doc1: He is a coder,and she is a coder too.
Doc2:Json is a doctor,but he was a coder.

第一步:获取关键词keywords

a:分词(按空格分词):

Doc1: [He] [is] [a] [coder],[and] [she] [is] [a] [coder] [too]
Doc2: [Json] [is] [a] [doctor] [he] [a] [coder]

b:去除 stopwords(无意义的关键词)

Doc1:[He] [a] [coder] [she] [a] [coder]
Doc2:[Json] [is] [a] [doctor] [he] [a] [coder]

c:统一(大小写,时态)

Doc1:[he] [a] [coder] [she] [a] [coder]
Doc2:[json] [a] [doctor] [he] [a] [coder]

第二步:建立倒排索引

//关键词出现的文章

keywords doc [he] 1,2 [a] 1,2 [coder] 1,2 [she] 1 [json] 2 [doctor] 2

//更好的结构 记录关键词出现的文章,出现频率(对结果排序),出现位置(用户快速锁定高亮位置)

keywords doc[times] doc[index] [he] 1[1],2[1] 1[1],2[4] [a] 1[2],2[2] 1[2,5],2[2,5] [coder] 1[2],2[1] 1[3,6],2[6] [she] 1[1] 1[4] [json] 2[1] 2[1] [doctor] 2[1] 2[3]

第三步:搜索

a.输入搜索语句: doctor and coder
b.获取a中的关键词,得到[doctor] [coder]
c.从索引表中得到:[coder]在Doc1中出现2次,在Doc2中出现1次,[doctor] 在Doc2中出现1次。
d.由此可以得到Doc2因为关联两个关键词,关联性更高(如果关联关键词数量一次,则可以根据出现频率排序),搜索出的结果顺序为:Doc2,Doc1

代码实现

Java代码模拟倒排索引过程

原创粉丝点击