lucene Ngram 划分词语

来源:互联网 发布:长虹老年机淘宝 编辑:程序博客网 时间:2024/04/27 19:55

最近在做一个有关文本挖掘的项目,需要用到Ngram模型已经相对应的向量匹配相似度的技术

Ngram分词的程序

有位网友在问我,想了想写在这里吧,至于那些jar包也很好找,lucene jar ,在百度搜索都能找到

package edu.fjnu.huanghong;import java.io.IOException;import java.io.StringReader;import org.apache.lucene.analysis.Tokenizer;import org.apache.lucene.analysis.ngram.NGramTokenizer;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;import org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute;import org.apache.lucene.analysis.tokenattributes.TypeAttribute;import org.apache.lucene.util.Version;/* *  * import org.apache.lucene.analysis.ngram.Lucene43EdgeNGramTokenizer;   import org.apache.lucene.analysis.ngram.Lucene43NGramTokenizer; * */public class Ngram {public static void main(String[] args) {String s = "捡 白色 iphone6 手机 壳 透明 失主 方式 15659119418  ";String[] str = s.split(" ");StringBuilder sb = new StringBuilder();for(int i = 0; i < str.length; i++){sb.append(str[i]);}System.out.println(sb.toString());StringReader sr = new StringReader(sb.toString());//N-gram模型分词器Tokenizer tokenizer = new NGramTokenizer(Version.LUCENE_45,sr);testtokenizer(tokenizer);}private static void testtokenizer(Tokenizer tokenizer) {try {tokenizer.reset();while(tokenizer.incrementToken())<span style="white-space:pre"></span>{CharTermAttribute charTermAttribute=tokenizer.addAttribute(CharTermAttribute.class);)System.out.print(charTermAttribute.toString()+"|");}tokenizer.end();tokenizer.close();} catch (IOException e) {e.printStackTrace();}}}

不知道有没有 哪位前辈有关于qgram的相关知识- - ,翻墙了都找不到,如有希望能私信我,不胜感激

0 0
原创粉丝点击