Lucene 分词器

来源：互联网发布：加拿大订酒店软件编辑：程序博客网时间：2024/04/19 14:59

自从停止学习 android 开始，我的动力就没有之前那么充足了，也不知道是对是错，总之还在休假中。。。

分词

尽管战斗圣皇很强，but I will beat him someday.

先用 WhitespaceAnalyzer 分词，结果为：

以空格为分词的判断。

其他的分词器可能会过滤标点，可能会过滤冠词、介词。

而 StandardAnalyer 会将语句的每个个词，作为一个 token，就像这样：战|斗|圣|皇 ...

显然这不是一个好的方式。

一方面，如果这样可行，就不需要发明其他分词器了，统一这样何不快哉。

另一方面，这样的分割就说明没有重点。

分割的越多查询速度就越慢，而且语义被破坏的情况下查询的精确性也不好。

分词的演示

public static void main(String[] args) throws Exception {        WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();     //Analyzer analyzer = new IKAnalyzer(true);        TokenStream tokenStream = null;        try {            tokenStream = analyzer.tokenStream("hehe", new StringReader("尽管战斗圣皇很强，but I will beat him someday."));            OffsetAttribute offset = tokenStream.addAttribute(OffsetAttribute.class);            CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);            PositionIncrementAttribute position = tokenStream.addAttribute(PositionIncrementAttribute.class);            TypeAttribute type = tokenStream.addAttribute(TypeAttribute.class);            //重置TokenStream（重置StringReader）            tokenStream.reset();            while (tokenStream.incrementToken()) {                System.out.println(position.getPositionIncrement() + " " + offset.startOffset() + " - "                                + offset.endOffset() + " : " + term.toString() + " | " + type.type()                );            }                        tokenStream.end();        } finally {            if (tokenStream != null) {                tokenStream.close();            }        }    }

总共有这么几个概念：

位置增量，偏移量，term，类型，标志位，有效负载。

前两个量和位置有关，中间两个没什么好说的，最后两个尚未关心。

因为分离的思想，除了可以修改 main2012.dic，我们还可以添加额外的词典。

IKAnalyzer 的配置文件和词典放在 src 同一个目录。

除了词典为，还有个stopword.dic ，放一些想要过滤的词，比如 “很强”。

0 0