小曹谈技术之索引&词典结构

来源：互联网发布：mac 读取exe 编辑：程序博客网时间：2024/04/30 09:42

基于散列表的索引结构，全匹配

速度快，实现简单，但是不支持部分匹配。

基于前缀树，后缀树的索引结构，部分匹配

一个前缀树(Prefix tree)的实现

http://whiteboxcomputing.com/java/prefix_tree/

In addition to the efficiency, triealso provides flexibility in searching for the closest path in case that thekey is misspelled. For example, by skipping a certain character in the keywhile walking, we can fix the insertion kind of typo. By walking toward all theimmediate children of one node without consuming a character from the key, wecan fix the deletion typo, or even substitution typo if we just drop the keycharacter that has no branch to go and descend to all the immediate children ofthe current node.

一个双Trie树的实现，datrie

http://linux.thai.net/~thep/datrie/datrie.html

Static Double Array Trie (DASTrie):Windows平台可用！

http://www.chokkan.org/software/dastrie/

后缀树

http://www.allisons.org/ll/AlgDS/Tree/Suffix/

可以用来高效地解决求多个字串的最大公共字串，一个字串的最大重复字串等问题。

与前缀树不同，前缀树是预先对待匹配的模式进行处理，建立前缀树。而后缀树是对待处理的文本进行处理！对待处理的文本建立后缀树。

后缀树快速构建算法

http://www.blogjava.net/Files/zellux/SuffixT1withFigs.rar

后缀树的实现：

http://sfxdisk.dead-inside.org/

http://mila.cs.technion.ac.il/~yona/suffix_tree/

倒排索引(Inverted Index)

倒排索引是支持快速找到词（一个词或者多个词）在哪些文件出现过的一种索引结构。Lucene实现的就是倒排索引。

http://lucene.apache.org/

索引的压缩(Compression)

当索引太大了时，需要进行索引的压缩。

常用的压缩方法：

文本压缩: Huffman编码

前缀压缩：Prefix compression,将没有分支的一条路径上的所有顺序节点合并为一个节点。

Suffix Compression

[Aoe1989] alsosuggested a storage compression strategy, by splitting non-branching suffixesinto single string storages, calledtail, so that the restnon-branching steps are reduced into mere string comparison.

With the two separate data structures,double-array branches and suffix-spool tail, key insertion and deletionalgorithms must be modified accordingly.

倒排索引的压缩：