Stanford分词实战

来源：互联网发布：java 神经网络框架编辑：程序博客网时间：2024/06/14 00:16

介绍

英文自带分词，而中文最小粒度由字组成，使用得分词。
Stanford分词开源工具主页地址：https://nlp.stanford.edu/software/segmenter.shtml

原始文本的tokenization(标记化)是许多NLP任务的标准预处理步骤。对于英文来说，标记化通常涉及标点符号分割和分离一些词缀。其他语言需要更广泛的tokenization预处理，通常称为分词。

斯坦福大词典目前支持阿拉伯语和中文。Stanford Tokenizer可用于英文，法文和西班牙文。
需要jdk1.8+。

Stanford工具中文分词：
中文需要分词，本工具是基于CRF的中文字分割器的Java实现。
实现基于论文：A Conditional Random Field Word Segmenter
论文地址：https://nlp.stanford.edu/pubs/sighan2005.pdf

这个版本包含二种独立的分词：Chinese Penn Treebank standard 和Peking University standard.

之后发布了一个能够利用外部词汇特征的版本，这个版本分词更加精确，实现基于
论文：Optimizing Chinese Word Segmentation for Machine Translation Performance.
论文地址：https://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf

该包包括用于命令行调用的组件和一个Java API。下载的解压包包含模型文件，编译代码和源文件。如果你打开tar文件，你应该有一切需要。包含简单的脚本来调用分词器。

最新版本下载地址：https://nlp.stanford.edu/software/stanford-segmenter-2017-06-09.zip

实战

将文件里面的data放进工程，然后把stanford-segmenter-3.8.0.jar, stanford-segmenter-3.8.0-javadoc.jar，stanford-segmenter-3.8.0-sources.jar三个jar包添加进lib。然后把文件提供的SegDemo拷进去，直接运行即可。注意文件的路径放置，如果出现问题，修改文件路径使其对应。

可以通过输入参数运行：在run的Program arguments输入文件路径，比如src\test.txt
运行输出分词结果

SegDemo代码：

package WordSegmenter;import edu.stanford.nlp.ie.crf.CRFClassifier;import edu.stanford.nlp.ling.CoreLabel;import java.io.*;import java.util.Properties;/** 通过参数输入文本，输出分词结果* */public class SegDemo {  private static final String basedir = System.getProperty("SegDemo", "data/pos_model");  public static void main(String[] args) throws Exception {    System.setOut(new PrintStream(System.out, true, "utf-8"));    //设置参数    Properties props = new Properties();    props.setProperty("sighanCorporaDict", basedir);    // props.setProperty("NormalizationTable", "data/norm.simp.utf8");    // props.setProperty("normTableEncoding", "UTF-8");    // below is needed because CTBSegDocumentIteratorFactory accesses it    props.setProperty("serDictionary", basedir + "/dict-chris6.ser.gz");    if (args.length > 0) {      props.setProperty("testFile", args[0]);    }    props.setProperty("inputEncoding", "UTF-8");    props.setProperty("sighanPostProcessing", "true");    CRFClassifier<CoreLabel> segmenter = new CRFClassifier<>(props);    segmenter.loadClassifierNoExceptions(basedir + "/ctb.gz", props);//参数的文件分词    for (String filename : args) {    segmenter.classifyAndWriteAnswers(filename);    }    String sample = "我住在美国。";    List<String> segmented = segmenter.segmentString(sample);    System.out.println(segmented);  }}

运行结果：
src\test.txt为输入参数的分词文件

testFile=src\test.txtserDictionary=data/pos_model/dict-chris6.ser.gzsighanCorporaDict=data/pos_modelinputEncoding=UTF-8sighanPostProcessing=trueLoading Chinese dictionaries from 1 file:  data/pos_model/dict-chris6.ser.gzDone. Unique words in ChineseDictionary is: 423200.Loading classifier from data/pos_model/ctb.gz ... done [10.6 sec].Loading character dictionary file from data/pos_model/dict/character_list [done].Loading affix dictionary from data/pos_model/dict/in.ctb [done].我的 是 你 的 嘛CRFClassifier tagged 6 words in 1 documents at 81.08 words per second.[我, 住在, 美国, 。]

参考：https://nlp.stanford.edu/software/segmenter.shtml

阅读全文

0 0