1.getting started Stanford CoreNLP

CoreNLP的核心包包括两个类:Annotation 和 Annotator。

Annotations 是用来保存 annotators的结果的数据结构,Annotations 一般是map,Annotators 更像函数,不过他们对Annotations进行操作,而不是针对Objects。

Annotators 可以进行 tokenize,parse,NER,POS。Annotators 和Annotations 整合在 AnnotationPipelines 中,Stanford CoreNLP 继承了AnnotationPipeline 类,并且自定义了NLPAnnotators。Annotators 的输出需要使用 CoreMap 和 CoreLabel来获取。

1. 通过StanfordCoreNLP(Properties props)来创建StanfordCoreNLP对象

2. 通过annotate(Annotation document) 来解析任意的文本。

public class WordSeg {public static void main(String[] args) {// 创建一个StanfordCoreNLP对象,// 包括POS tagging, lemmatization, NER, parsing, and coreference// resolutionProperties props = new Properties();props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,dcoref");// 创建一个Stanford coreNLP对象StanfordCoreNLP pipeline = new StanfordCoreNLP(props);String text = "Until the 19th century and the first opium war, Shanghai was considered to be essentially a fishing village. However, in 1914, Shanghai had 200 banks dealing with 80% of its foreign investments in China.";// 用上述文本创建一个空的AnnotationAnnotation document = new Annotation(text);System.out.println("空的Annotation:"+document);// 对文本进行所有上述定义的操作pipeline.annotate(document);// 这是text中所有的sentences// CoreMap<class object,custom types>List<CoreMap> sentences = document.get(SentencesAnnotation.class);for (CoreMap sentence : sentences) {System.out.println("sentence:"+sentence);// CoreLabel是具有特殊token处理方法的CoreMapfor (CoreLabel token : sentence.get(TokensAnnotation.class)) {System.out.println("token:"+token);// 这是token的文本内容(word)String word = token.get(TextAnnotation.class);System.out.println("word:"+word);// 这是token的词性标注标签String pos = token.get(PartOfSpeechAnnotation.class);System.out.println("pos:"+pos);// 这是token的NER标签String ne = token.get(NamedEntityTagAnnotation.class);System.out.println("ne:"+ne);}// 这是sentence的句法分析树Tree tree = sentence.get(TreeAnnotation.class);System.out.println(tree);// 这是sentence的依赖图SemanticGraph dependencies = sentence.get(CollapsedDependenciesAnnotation.class);System.out.println(dependencies);}// 这是指代链的图Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);System.out.println(graph);}}
sentence:Until the 19th century and the first opium war, Shanghai was considered to be essentially a fishing village.token:Until-1word:Untilpos:INne:Otoken:the-2word:thepos:DTne:DATEtoken:19th-3word:19thpos:JJne:DATEtoken:century-4word:centurypos:NNne:DATEtoken:and-5word:andpos:CCne:O
// 句法分析树
(ROOT (S (PP (IN Until) (NP (NP (DT the) (JJ 19th) (NN century)) (CC and) (NP (DT the) (JJ first) (NN opium) (NN war)))) (, ,) (NP (NNP Shanghai)) (VP (VBD was) (VP (VBN considered) (S (VP (TO to) (VP (VB be) (NP (RB essentially) (DT a) (NN fishing) (NN village))))))) (. .)))
// 依赖图
-> considered/VBN (root)  -> century/NN (nmod:until)    -> Until/IN (case)    -> the/DT (det)    -> 19th/JJ (amod)    -> and/CC (cc)    -> war/NN (conj:and)      -> the/DT (det)      -> first/JJ (amod)      -> opium/NN (compound)  -> ,/, (punct)  -> Shanghai/NNP (nsubjpass)  -> was/VBD (auxpass)  -> village/NN (xcomp)    -> to/TO (mark)    -> be/VB (cop)    -> essentially/RB (advmod)    -> a/DT (det)    -> fishing/NN (compound)  -> ./. (punct)
// 指代链
{1=CHAIN1-["first" in sentence 1], 2=CHAIN2-["Shanghai" in sentence 1, "Shanghai" in sentence 2], 3=CHAIN3-["the 19th century and the first opium war" in sentence 1], 4=CHAIN4-["the 19th century" in sentence 1], 5=CHAIN5-["the first opium war" in sentence 1], 6=CHAIN6-["essentially a fishing village" in sentence 1], 8=CHAIN8-["200" in sentence 2], 9=CHAIN9-["China" in sentence 2], 10=CHAIN10-["1914" in sentence 2, "its" in sentence 2], 11=CHAIN11-["200 banks dealing with 80 % of its foreign investments in China" in sentence 2], 12=CHAIN12-["its foreign investments" in sentence 2]}


SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".SLF4J: Defaulting to no-operation (NOP) logger implementationSLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.


<dependency>    <groupId>org.slf4j</groupId>    <artifactId>slf4j-simple</artifactId>    <version>1.7.12</version></dependency>

