Jcseg分词 介绍
来源:互联网 发布:电脑桌面怎么下载淘宝 编辑:程序博客网 时间:2024/05/17 09:18
今天给大家介绍一下 Jcseg 分词 首先我先来 让大家跑通一个程序然后大家在慢慢研究 步骤如下
1、解压这个jar包 , jcseg-1.9.4-src-jar-dict.zip 下载路径:http://download.csdn.net/detail/u010310183/8041677
2、自己建立个项目
1) 首先创建一个 config 文件夹 , 文件夹下 创建 jcseg.properties 。 jcseg.properties中内容如下:
# jcseg properties file.# bug report chenxin <chenxin619315@gmail.com># Jcseg function#maximum match length. (5-7)jcseg.maxlen=5#recognized the chinese name.(1 to open and 0 to close it)jcseg.icnname=1#maximum chinese word number of english chinese mixed word. jcseg.mixcnlen=2#maximum length for pair punctuation text.jcseg.pptmaxlen=15#maximum length for chinese last name andron.jcseg.cnmaxlnadron=1#Wether to clear the stopwords.(set 1 to clear stopwords and 0 to close it)jcseg.clearstopword=0#Wether to convert the chinese numeric to arabic number. (set to 1 open it and 0 to close it)# like '\u4E09\u4E07' to 30000.jcseg.cnnumtoarabic=1#Wether to convert the chinese fraction to arabic fraction.jcseg.cnfratoarabic=1#Wether to keep the unrecognized word. (set 1 to keep unrecognized word and 0 to clear it)jcseg.keepunregword=1#Wether to start the secondary segmentation for the complex english words.jcseg.ensencondseg = 1#min length of the secondary simple token. (better larger than 1)jcseg.stokenminlen = 2#thrshold for chinese name recognize.# better not change it before you know what you are doing.jcseg.nsthreshold=1000000#The punctuations that will be keep in an token.(Not the end of the token).jcseg.keeppunctuations=@%.&+####about the lexicon#prefix of lexicon file.lexicon.prefix=lex#suffix of lexicon file.lexicon.suffix=lex#abusolte path of the lexicon file.#Multiple path support from jcseg 1.9.2, use ';' to split different path.#example: lexicon.path = /home/chenxin/lex1;/home/chenxin/lex2 (Linux)#: lexicon.path = D:/jcseg/lexicon/1;D:/jcseg/lexicon/2 (WinNT)lexicon.path=D:/jcseg/lexicon/jcseg-1.9.4-src-jar-dict/jcseg-1.9.4/lexicon#Wether to load the modified lexicon file auto.lexicon.autoload=1#Poll time for auto load. (seconds)lexicon.polltime=120####lexicon load#Wether to load the part of speech of the entry.jcseg.loadpos=1#Wether to load the pinyin of the entry.jcseg.loadpinyin=0#Wether to load the synoyms words of the entry.jcseg.loadsyn=1
2)在创建一个类 类中内容如下
package jcseg;import java.io.IOException;import java.io.StringReader;import org.lionsoul.jcseg.ASegment;import org.lionsoul.jcseg.core.ADictionary;import org.lionsoul.jcseg.core.DictionaryFactory;import org.lionsoul.jcseg.core.ILexicon;import org.lionsoul.jcseg.core.IWord;import org.lionsoul.jcseg.core.JcsegException;import org.lionsoul.jcseg.core.JcsegTaskConfig;import org.lionsoul.jcseg.core.SegmentFactory;public class test {public static void main(String[] args) throws IOException, JcsegException {//创建JcsegTaskConfig分词任务实例//即从jcseg.properties配置文件中初始化的配置JcsegTaskConfig config = new JcsegTaskConfig("config/jcseg.properties");//config.setAppendCJKPinyin(true);//创建默认词库(即: com.webssky.jcseg.Dictionary对象)//并且依据给定的JcsegTaskConfig配置实例自主完成词库的加载ADictionary dic = DictionaryFactory.createDefaultDictionary(config,true);dic.loadFromLexiconFile("D:/jcseg/lexicon/jcseg-1.9.4-src-jar-dict/jcseg-1.9.4/lexicon/lex-main.lex");//这个路径是jcseg-1.9.4-src-jar-dict.zip 这个jar 包的 存放路径, 你自己找lexicon 文件夹下的 lex-main.lex//dic.loadFromLexiconDirectory(config, config.getLexiconPath());//System.out.println(w);//依据给定的ADictionary和JcsegTaskConfig来创建ISegment//通常使用SegmentFactory#createJcseg来创建ISegment对象//将config和dic组成一个Object数组给SegmentFactory.createJcseg方法//JcsegTaskConfig.COMPLEX_MODE表示创建ComplexSeg复杂ISegment分词对象//JcsegTaskConfig.SIMPLE_MODE表示创建SimpleSeg简易Isegmengt分词对象.ASegment seg = (ASegment) SegmentFactory.createJcseg(JcsegTaskConfig.COMPLEX_MODE,new Object[]{config, dic});//设置要分词的内容String str = "研究";seg.reset(new StringReader(str));//获取分词结果IWord word = null;while ( (word = seg.next()) != null ) {System.out.println(word.getValue());}}}3、运行项目 这个项目主要先让你跑起来, 当你输入 研究的时候,会把字库中关于 研究的 相近词 都查询出来。
4、想更多了解 Jcseg 功能 请下载 文档介绍 下载地址如下: http://download.csdn.net/detail/u010310183/8041725
2 0
- Jcseg分词 介绍
- jcseg分词
- 中文分词器 jcseg
- Lucene中文分词Jcseg
- Luence 4.4 Jcseg中文分词简单测试
- jcseg分词自动识别填充数据属性
- Jcseg分词器的实现详解
- jcseg中文分词器去除不需要的分词
- word分词器、ansj分词器、IKanalyzer分词器、mmseg4j分词器、jcseg分词器对比
- word分词器、ansj分词器、IKanalyzer分词器、mmseg4j分词器、jcseg分词器对比
- Luence 4.4 Jcseg分词器构建索引以及检索测试
- Solr4 + Jcseg(分词器) 安装配置--源自技术
- jcseg-1.9.2 发布 - Java开源轻量级中文分词器+里程碑版本
- jcseg-1.9.4 发布 - Java轻量级开源中文分词器-检测模式切分
- 搜索引擎:solr--搭建和分析中文分词器。下(jcseg和ICTCLAS)
- IKanalyzer、ansj_seg、jcseg三种中文分词器的实战较量
- Lucene 5.2.1 + jcseg 1.9.6中文分词索引(Lucene 学习序列2)
- IKanalyzer、ansj_seg、jcseg三种中文分词器的实战较量
- 解决gnome-shell下龙井内核wineqq提示信息残留过多的问题
- poj1006
- iOS8 无法自动定位城市
- 开博了
- 如何清空结构体?
- Jcseg分词 介绍
- linux定时任务二
- struts2中使用ModelDriven
- 今天用VS2005编译DirectShow程序,一些莫名其妙的问题总结
- 程序设计作业上机实践项目二
- Unity3D自带地形系统的创建与简单设置图文详解
- Java中String.split()用法
- 堆和栈的区别
- MySQL数据库查询变慢的分析及解决过程