java鬼混笔记:lucene 3、单词分法,二分法,停用词

来源:互联网 发布:怎样快速提高淘宝等级 编辑:程序博客网 时间:2024/06/05 10:08

这次的笔记是玩玩lucene自带的两个分词器StandardAnalyzer(单词分),CJKAnalyzer(二词分),和它们对停用词(停用词:就是不进行拆分的词。。。)

上代码:

package cn;import java.io.IOException;import java.io.StringReader;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.cjk.CJKAnalyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;import org.apache.lucene.analysis.util.CharArraySet;import org.apache.lucene.util.Version;public class fenci {public static void main(String[] args) throws IOException {String txt = "我是中国人,爱和平爱团结";// 单个词分法:一个一个词分开// 1、单词分词器Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);// 2、分词后的词TokenStream ts = analyzer.tokenStream("content", new StringReader(txt));// 3、分别获取分词后的每个词while(ts.incrementToken()) {CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class);System.out.println(cta.toString());// 这时控制台一行一个字 ,如果txt是英文,也是一个一个单词输出的}analyzer.close();/*控制台打印:我是中国人爱和平爱团结*/// 二词分法:相近的两个字成一个词,比如 ‘我是中国人’ 会分成:我是,是中,中国,国人Analyzer analyzer2 = new CJKAnalyzer(Version.LUCENE_40);// 同上说明TokenStream ts2 = analyzer2.tokenStream("content", new StringReader(txt));// 同上说明while(ts2.incrementToken()) {CharTermAttribute cta = ts2.getAttribute(CharTermAttribute.class);System.out.println(cta.toString());// }analyzer2.close();/*控制台打印:我是是中中国国人爱和和平平爱爱团团结*/// 停用词:也就是不进行分记的词CharArraySet cas = StandardAnalyzer.STOP_WORDS_SET;System.out.println("StandardAnalyzer:"+cas);// 全是英语,可以看得出StandardAnalyzer是英文传用的cas = CJKAnalyzer.getDefaultStopSet();System.out.println("CJKAnalyzer:"+cas);/*控制台打印:StandardAnalyzer:[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of]CJKAnalyzer:[but, be, with, such, if, for, no, will, not, are, and, their, then, this, on, into, a, or, there, in, that, they, was, is, it, at, the, as, s, t, these, by, to, of, www]*/// 停用词自定义:停用词也可以自定义,针对那些粗话,政治敏感词等CharArraySet casDiy = new CharArraySet(Version.LUCENE_40, 0, true);casDiy.add("fuck");// 加上...好在控制台查看casDiy.add("shit");casDiy.addAll(StandardAnalyzer.STOP_WORDS_SET);// 补回原来的// 验证一下自定义Analyzer analyzer3 = new StandardAnalyzer(Version.LUCENE_40);// 没加自定义停用词前String t3 = "apple fuck android, wp shit ";// 语法肯定错TokenStream ts3 = analyzer3.tokenStream("content", new StringReader(t3));while(ts3.incrementToken()) {CharTermAttribute cta = ts3.getAttribute(CharTermAttribute.class);System.out.println(cta.toString());// }analyzer3.close();/*控制台打印:applefuckandroidwpshit*/Analyzer analyzer5 = new StandardAnalyzer(Version.LUCENE_40, casDiy);// 加自定义停用词后String t5 = "apple fuck android, wp shit ";// 语法肯定错TokenStream ts5 = analyzer5.tokenStream("content", new StringReader(t5));while(ts5.incrementToken()) {CharTermAttribute cta = ts5.getAttribute(CharTermAttribute.class);System.out.println(cta.toString());// }analyzer5.close();/*控制台打印:appleandroidwp*/// 去掉了自定义的停用词fuck和shit}}


原创粉丝点击