IKAnalyzer结合Lucene使用和单独使用例子简单性能测试

来源：互联网发布：linux卸载ibus 编辑：程序博客网时间：2024/06/15 02:30

使用到的Jar包：结合Lucene使用直接通过Analyzer进行分词使用IKSegmenter进行分词性能测试：

IKAnalyzer是一个开源基于JAVA语言的轻量级的中文分词第三方工具包，采用了特有的“正向迭代最细粒度切分算法“，支持细粒度和智能分词两种切分模式。刚开始使用的时候，发现不能支持中文和字母混合的分词，例如：iPhone5s土豪金。后来发现在2012版本，词典支持中文，英文，数字混合词语，并且优化了词典存储，内存更小的占用。支持用户词典扩展定义。为了更好的测试，这里就使用了IKAnalyzer2012_u6这个版本。

使用到的Jar包：

IKAnalyzer2012_u6.jar
lucene-core-3.6.0.jar

把 IKAnalyzer中的IKAnalyzer.cfg.xml, ext.dic（如果找不到，可以手动创建一个该文件）, stopword.dic文件放到代码的根目录中。

结合Lucene使用

下载下来的Jar包是包含了结合Lucene使用的例子，先把要检索的内容，写入Lucene索引，然后根据需要查找的关键词，通过Lucene的QueryParser对象进行解析查找，构造该QueryParser对象的时候，传入了IKAnalyzer，进而通过IKAnalyzer进行分词：

Analyzer analyzer = new IKAnalyzer(true);QueryParser qp = new QueryParser(Version.LUCENE_34, fieldName, analyzer);

ext.dic词典如下：

iPhone5s土豪金2014巴西世界杯

完整代码如下：

// 使用Lucene分词  //Lucene Document的域名String fieldName = "text";//检索内容String text = "据说WWDC要推出iPhone6要出了？与iPhone5s土豪金相比怎样呢？";//实例化IKAnalyzer分词器Analyzer analyzer = new IKAnalyzer(true);Directory directory = null;IndexWriter iwriter = null;IndexReader ireader = null;IndexSearcher isearcher = null;try {    //建立内存索引对象    directory = new RAMDirectory();         //配置IndexWriterConfig    IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_34 , analyzer);    iwConfig.setOpenMode(OpenMode.CREATE_OR_APPEND);    iwriter = new IndexWriter(directory , iwConfig);    //写入索引    Document doc = new Document();    doc.add(new Field("ID", "10000", Field.Store.YES, Field.Index.NOT_ANALYZED));    doc.add(new Field(fieldName, text, Field.Store.YES, Field.Index.ANALYZED));    iwriter.addDocument(doc);    iwriter.close();    //搜索过程**********************************    //实例化搜索器       ireader = IndexReader.open(directory);    isearcher = new IndexSearcher(ireader);                String keyword = "iPhone5s土豪金";                //使用QueryParser查询分析器构造Query对象    QueryParser qp = new QueryParser(Version.LUCENE_34, fieldName, analyzer);    qp.setDefaultOperator(QueryParser.AND_OPERATOR);    Query query = qp.parse(keyword);    System.out.println("Query = " + query);    //搜索相似度最高的5条记录    TopDocs topDocs = isearcher.search(query , 5);    System.out.println("命中：" + topDocs.totalHits);    //输出结果    ScoreDoc[] scoreDocs = topDocs.scoreDocs;    for (int i = 0; i < topDocs.totalHits; i++){        Document targetDoc = isearcher.doc(scoreDocs[i].doc);        System.out.println("内容：" + targetDoc.toString());    }} catch (CorruptIndexException e) {    e.printStackTrace();} catch (LockObtainFailedException e) {    e.printStackTrace();} catch (IOException e) {    e.printStackTrace();} catch (ParseException e) {    e.printStackTrace();} finally{    if(ireader != null){        try {            ireader.close();        } catch (IOException e) {            e.printStackTrace();        }    }    if(directory != null){        try {            directory.close();        } catch (IOException e) {            e.printStackTrace();        }    }}

执行结果：

加载扩展词典：ext.dic加载扩展停止词典：stopword.dicQuery = text:iphone5s土豪金命中：1内容：Document<stored,indexed<ID:10000> stored,indexed,tokenized<text:据说WWDC要推出iPhone6要出了？与iPhone5s土豪金相比怎样呢？>>

直接通过Analyzer进行分词

如果我们不需要建立Lucene索引文件，而是单纯的对一段文本进行分词，可以直接创建一个org.apache.lucene.analysis.Analyzer分词对象（org.wltea.analyzer.lucene.IKAnalyzer IK分词主类，基于Lucene的Analyzer接口实现）进行遍历分词数据。

下面演示下，并且在分词之前额外的添加一些单词到字典中。ext.dic词典如下：

iPhone5s土豪金2014巴西世界杯

代码如下：

// 检索内容String text = "据说WWDC要推出iPhone6要出了？与iPhone5s土豪金相比怎样呢？@2014巴西世界杯 test中文";List<String> list = new ArrayList<String>();list.add("test中文");// 尚未初始化，因为第一次执行分词的时候才会初始化，为了在执行分此前手动添加额外的字典，需要先手动的初始化一下Dictionary.initial(DefaultConfig.getInstance());Dictionary.getSingleton().addWords(list);//创建分词对象  Analyzer analyzer = new IKAnalyzer(true);       StringReader reader = new StringReader(text);  TokenStream ts = analyzer.tokenStream("", reader);  CharTermAttribute term = ts.getAttribute(CharTermAttribute.class);  //遍历分词数据  try {    while(ts.incrementToken()){          System.out.print(term.toString()+"|");      }} catch (IOException e) {    e.printStackTrace();} finally{    reader.close();}

执行结果：

加载扩展词典：ext.dic加载扩展停止词典：stopword.dic据说|wwdc|要|推出|iphone6|要|出了|与|iphone5s土豪金|相比|怎样|呢|2014巴西世界杯|test中文|

使用IKSegmenter进行分词

另外，如果想不结合Lucene（不使用lucene-core-3.6.0.jar），而是仅仅单独的使用IKAnalyzer，可以直接使用IK分词器的核心类，真正分词的实现类IKSegmenter分词器进行分词，代码如下：

// 单独使用// 检索内容String text = "据说WWDC要推出iPhone6要出了？与iPhone5s相比怎样呢？@2014巴西世界杯";// 创建分词对象  StringReader reader = new StringReader(text);IKSegmenter ik = new IKSegmenter(reader,true);// 当为true时，分词器进行最大词长切分Lexeme lexeme = null;try {    while((lexeme = ik.next())!=null)        System.out.println(lexeme.getLexemeText());} catch (IOException e) {    e.printStackTrace();} finally{    reader.close();}

性能测试：

检测目标：在单独使用IKAnalyzer的情况下，尽量往扩展字典添加词组，测试十几万长度的文本的分词效率。

扩展词库添加搜狗词库：

http://pinyin.sogou.com/dict/cell.php?id=11640

词条大小检索内容字数392790个13737KB158453

代码如下：

// 计算载入字典时间long startLoadDict = System.currentTimeMillis();Dictionary.initial(DefaultConfig.getInstance());long endLoadDict = System.currentTimeMillis();// 创建分词对象  StringReader reader = new StringReader(text);Lexeme lexeme = null;int hintTimes = 0;IKSegmenter ik = new IKSegmenter(reader,true);// 当为true时，分词器进行最大词长切分long start = System.currentTimeMillis();try {    while((lexeme = ik.next())!=null)        hintTimes ++;} catch (IOException e) {    e.printStackTrace();} finally{    reader.close();}long end = System.currentTimeMillis();System.out.println("载入字典时间：" + (endLoadDict - startLoadDict)/1000.0);System.out.println("处理文本字数：" + text.length());System.out.println("获取词元次数：" + hintTimes);System.out.println("执行总时间：" + (end - start)/1000.0 + "s");System.out.println("处理速度：" + text.length() / ((end - start)/1000.0) + "字/秒");System.out.println("本次获取词元速度：" + hintTimes / ((end - start)/1000.0) + "词/秒");

IKAnalyzer本身有27W的词库，加上扩展词典，经过优化的方式存储到内存空间中的。

结果：

加载扩展词典：ext.dic加载扩展停止词典：stopword.dic载入字典时间：2.51处理文本字数：158453获取词元次数：54424执行总时间：0.46s处理速度：344463.04347826086字/秒本次获取词元速度：118313.04347826086词/秒

======================================================================

原文地址：http://www.itzhai.com/ikanalyzer-lucene-demo-performance-test.html

0 0

IKAnalyzer结合Lucene使用和单独使用例子 简单性能测试

IKAnalyzer结合Lucene使用和单独使用例子简单性能测试