Stanford CoreNlp中英文Java API使用方法
来源:互联网 发布:可以看亚丝娜本子软件 编辑:程序博客网 时间:2024/04/30 23:14
Stanford Nlp是一个比较牛叉的自然语言处理工具,其很多模型都是基于深度学习方法进行训练得到的,准确率比起原来的很多工具有了很大程度的提高。近年来很多开源项目也用到了其中的一些方法。
最近重拾这个工具做点语义分析的工作,但是发现中文资料比较少,入门比较困难,所以整理一下自己的使用方法,希望对有需要的童鞋能够有点帮助。
本文主要是讲如何在Java工程中调用Stanford NLP的API。
一.环境准备
Eclipse或者IDEA,JDK1.8,Apache Maven(注意,3.5及以后的版本都需要Java8环境才能运行,如果不想在Java8运行的话,请使用以前的版本)。
建立好一个新的Maven工程,在pom文件中加入如下代码:
<properties> <corenlp.version>3.6.0</corenlp.version> </properties> <dependencies> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>${corenlp.version}</version> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>${corenlp.version}</version> <classifier>models</classifier> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>${corenlp.version}</version> <classifier>models-chinese</classifier> </dependency> </dependencies>
三个依赖包分别是CoreNlp的算法包、英文语料包、中文语料包,由于Maven默认镜像在国外,而Stanford NLP的模型文件很大,因此对网络要求比较高,网速慢的一不小心就time out下载失败了。 解决方法是找一个包含Stanford NLP依赖库的国内镜像,修改Maven的setting,xml中的mirror属性。
二.英文文本的处理
package edu.zju.cst.krselee.examples.english;import edu.stanford.nlp.dcoref.CorefChain;import edu.stanford.nlp.dcoref.CorefCoreAnnotations;import edu.stanford.nlp.ling.CoreAnnotations;import edu.stanford.nlp.ling.CoreLabel;import edu.stanford.nlp.pipeline.Annotation;import edu.stanford.nlp.pipeline.StanfordCoreNLP;import edu.stanford.nlp.semgraph.SemanticGraph;import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations;import edu.stanford.nlp.trees.Tree;import edu.stanford.nlp.trees.TreeCoreAnnotations;import edu.stanford.nlp.util.CoreMap;import java.util.List;import java.util.Map;import java.util.Properties;/** * Created by KrseLee on 2016/11/5. */public class StanfordEnglishNlpExample { public static void main(String[] args) { StanfordEnglishNlpExample example = new StanfordEnglishNlpExample(); example.runAllAnnotators(); } public void runAllAnnotators(){ // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution Properties props = new Properties(); props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // read some text in the text variable String text = "this is a simple text"; // Add your text here! // create an empty Annotation just with the given text Annotation document = new Annotation(text); // run all Annotators on this text pipeline.annotate(document); parserOutput(document); } public void parserOutput(Annotation document){ // these are all the sentences in this document // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class); for(CoreMap sentence: sentences) { // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) { // this is the text of the token String word = token.get(CoreAnnotations.TextAnnotation.class); // this is the POS tag of the token String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class); // this is the NER label of the token String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class); } // this is the parse tree of the current sentence Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class); System.out.println("语法树:"); System.out.println(tree.toString()); // this is the Stanford dependency graph of the current sentence SemanticGraph dependencies = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class); System.out.println("依存句法:"); System.out.println(dependencies.toString()); } // This is the coreference link graph // Each chain stores a set of mentions that link to each other, // along with a method for getting the most representative mention // Both sentence and token offsets start at 1! Map<Integer, CorefChain> graph = document.get(CorefCoreAnnotations.CorefChainAnnotation.class); }}
值得注意的是,Stanford NLP采用的是pipeline的方式,给用户一个参数的设置接口,之后的过程全都被封装好了,使用起来非常方便。所有的返回结果都保存在一个<pre>Annotation对象中,根据需要去获取。<a target=_blank href="http://nlp.stanford.edu/pubs/StanfordCoreNlp2014.pdf">The Stanford CoreNLP Natural Language Processing Toolkit</a> 一文中对PileLine方式做了详细的介绍,这里就不多说了,
需要提到一点就是参数中,后面的参数往往依赖于前面的参数(直观的讲,就是标注pos依赖于分词tokenize,语法分析paser依赖于标注,等等)。
三.中文文本的处理
相对于英文来说,中文文本的处理稍微麻烦一点,主要的地方在于一个配置文件。中文语料模型包中有一个默认的配置文件StanfordCoreNLP-chinese.properties ,文件内容如下:# Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system)annotators = segment, ssplit, pos, lemma, ner, parse, mention, coref# segmentcustomAnnotatorClass.segment = edu.stanford.nlp.pipeline.ChineseSegmenterAnnotatorsegment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gzsegment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinesesegment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gzsegment.sighanPostProcessing = true# sentence splitssplit.boundaryTokenRegex = [.]|[!?]+|[。]|[!?]+# pospos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger# nerner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gzner.applyNumericClassifiers = falsener.useSUTime = false# parseparse.model = edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz# corefcoref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatchcoref.input.type = rawcoref.postprocessing = truecoref.calculateFeatureImportance = falsecoref.useConstituencyTree = truecoref.useSemantics = falsecoref.md.type = RULEcoref.mode = hybridcoref.path.word2vec =coref.language = zhcoref.print.md.log = falsecoref.defaultPronounAgreement = truecoref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz主要是指定相应pipeline的操作步骤以及对应的语料文件的位置。实际使用中我们可能用不到所有的步骤,或者要使用不同的语料库,因此可以自定义配置文件,再引入代码中。主要的Java程序代码如下:
package edu.zju.cst.krselee.examples.chinese;import edu.stanford.nlp.dcoref.CorefChain;import edu.stanford.nlp.dcoref.CorefCoreAnnotations;import edu.stanford.nlp.ling.CoreAnnotations;import edu.stanford.nlp.ling.CoreLabel;import edu.stanford.nlp.pipeline.Annotation;import edu.stanford.nlp.pipeline.StanfordCoreNLP;import edu.stanford.nlp.semgraph.SemanticGraph;import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations;import edu.stanford.nlp.trees.Tree;import edu.stanford.nlp.trees.TreeCoreAnnotations;import edu.stanford.nlp.util.CoreMap;import edu.stanford.nlp.util.PropertiesUtils;import edu.zju.cst.krselee.examples.english.StanfordEnglishNlpExample;import java.util.List;import java.util.Map;import java.util.Properties;/** * Created by KrseLee on 2016/11/4. */public class StanfordChineseNlpExample { public static void main(String[] args) { StanfordChineseNlpExample example = new StanfordChineseNlpExample(); example.runChineseAnnotators(); } public void runChineseAnnotators(){ String text = "克林顿说,华盛顿将逐步落实对韩国的经济援助。" + "金大中对克林顿的讲话报以掌声:克林顿总统在会谈中重申,他坚定地支持韩国摆脱经济危机。"; Annotation document = new Annotation(text); StanfordCoreNLP corenlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties"); corenlp.annotate(document); parserOutput(document); } public void parserOutput(Annotation document){ // these are all the sentences in this document // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class); for(CoreMap sentence: sentences) { // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) { // this is the text of the token String word = token.get(CoreAnnotations.TextAnnotation.class); // this is the POS tag of the token String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class); // this is the NER label of the token String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class); System.out.println(word+"\t"+pos+"\t"+ne); } // this is the parse tree of the current sentence Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class); System.out.println("语法树:"); System.out.println(tree.toString()); // this is the Stanford dependency graph of the current sentence SemanticGraph dependencies = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class); System.out.println("依存句法:"); System.out.println(dependencies.toString()); } // This is the coreference link graph // Each chain stores a set of mentions that link to each other, // along with a method for getting the most representative mention // Both sentence and token offsets start at 1! Map<Integer, CorefChain> graph = document.get(CorefCoreAnnotations.CorefChainAnnotation.class); }}
参考文献:
[1] http://stanfordnlp.github.io/CoreNLP/index.html
[2] https://blog.sectong.com/blog/corenlp_segment.html
- Stanford CoreNlp中英文Java API使用方法
- Stanford CoreNLP API
- Stanford CoreNLP使用
- Stanford CoreNLP 介绍
- stanford corenlp自定义切词类
- Stanford coreNLP源码学习(1)
- Stanford CoreNLP遇到的问题
- 1.getting started Stanford CoreNLP
- Stanford CoreNLP 进行中文分词
- Stanford CoreNLP学习日记1
- Stanford CoreNLP学习日记2
- Stanford CoreNLP学习日记3
- Stanford CoreNLP学习日记4
- Stanford CoreNLP学习日记5
- Stanford CoreNLP生成CoNLL数据格式
- Stanford coreNLP 出现 in thread "main" java.lang.OutOfMemoryError: Java heap space
- 【java】使用Stanford CoreNLP处理英文(词性标注/词形还原/解析等)
- 采用Stanford CoreNLP实现英文单词词形还原
- JHipster笔记(一)JHipster安装与测试
- 2016.11.5--php环境搭配与调试(第一节)
- 《LabVIEW入门与实战开发100例》13-20
- 天分决定速度,勤奋决定高度
- sublime text3配置方法、插件推荐及使用技巧
- Stanford CoreNlp中英文Java API使用方法
- C#调用支付宝接口案例
- MD5算法原理
- 使用Apache Commons Configuration读取配置信息
- 欢迎使用CSDN-markdown编辑器
- Java 枚举
- 在Docker容器中运行Spring Boot应用
- 关系表达式
- mysql数据库基础知识