Using Parser to Extract

来源:互联网 发布:中国8月进出口数据库 编辑:程序博客网 时间:2024/05/20 10:22

Parsing

Parsing is the process of creating a parse tree for a textual unit

A parse tree is a hierarchical data structure that represents the syntactic structure of a sentence.

Parsing is used for many tasks, including:

  • Machine translation of languages
  • Synthesizing speech from text
  • Speech recognition
  • Grammar checking
  • Information extraction

Coreference resolution is the condition where two or more expressions in text refer to the same individual or thing.

Relationship types

An interesting site that contains a multitude
of relationships is Freebase(https://www.freebase.com/). It is a database of people, places, and things organized by categories. The WordNet thesaurus (http://wordnet.princeton.edu/) contains a number of relationships

Relationship Example Personal father-of, sister-of, girlfriend-of Organizational subsidiary-of, subcommittee–of Spatial near-to, northeast-of, under Physical part-of, composed-of Interactions bonds-with, associates-with, reacts-with

Two types of parsing:

  • Dependency: This focuses on the relationship between words
  • Phrase structure: This deals with phrases and their recursive structure

Dependencies can use labels such as subject, determiner, and prepositions to find
relationships.

Parsing techniques include shift-reduce, spanning tree, and cascaded chunking.

Understanding parse trees

Parse trees represent hierarchical relationships between elements of text. For example, a dependency tree shows the relationship between the grammatical elements of a sentence.

(ROOT    (S    (NP (DT The) (NN cow))    (VP (VBD jumped)        (PP (IN over)            (NP (DT the) (NN moon))))    (. .)))

Using extracted relationships

Think about 时下学界/业界关注的知识图谱,底层原理,应该是语言内部、语言及其所联系的背景知识的关系的model

Relationships extracted can be used for a number of purposes including:

  • Building knowledge bases
  • Creating directories
  • Product searches
  • Patent analysis
  • Stock analysis
  • Intelligence analysis

There are many databases built using Wikipedia that extract relationships and information
such as:
- Resource Description Framework (RDF): This uses triples such as Yosemite-location-California, where the location is the relation. This can be found at
http://www.w3.org/RDF/.
- DBPedia: This holds over one billion triples and is an example of a knowledge base created from Wikipedia. This can be found at http://dbpedia.org/About.

Extracting relationships

There are a number of techniques available to extract relationships. These can be grouped as follows:

  • Hand-built patterns
  • Supervised methods
  • Semi-supervised or unsupervised methods
    • Bootstrapping methods
    • Distant supervision methods
    • Unsupervised methods

Hand-built models are used when we have no training data.

If only a little training data is amiable, then the Naive Bayes classifier is a good choice. When more data is available, then techniques such as SVM, Regularized Logistic Regression, and Random forest can be used.

//OpenNLPString fileLocation = getModelDir() + "/en-parser-chunking.bin";try (InputStream modelInputStream = new FileInputStream(fileLocation);) {    ParserModel model = new ParserModel(modelInputStream);    Parser parser = ParserFactory.create(model);    String sentence = "The cow jumped over the moon";    //return the top three parses    Parse parses[] = ParserTool.parseLine(sentence, parser, 3);    for(Parse parse : parses)     {        parse.show();        parse.showCodeTree();        System.out.println("Probability: " + parse.getProb());        Parse children[] = parse.getChildren();        for (Parse parseElement : children)         {            System.out.println(parseElement.getText());            System.out.println(parseElement.getType());            Parse tags[] = parseElement.getTagNodes();            System.out.println("Tags");            for (Parse tag : tags)             {                System.out.println("[" + tag + "]" + " type: " + tag.getType() + " Probability: " + tag.getProb() + " Label: " + tag.getLabel());            }        }    }} catch (IOException ex) {    // Handle exceptions}
//StanfordNLPString parserModel = ".../models/lexparser/englishPCFG.ser.gz";LexicalizedParser lexicalizedParser = LexicalizedParser.loadModel(parserModel);String[] senetenceArray = {"The", "cow", "jumped", "over","the", "moon", "."};List<CoreLabel> words = Sentence.toCoreLabelList(senetenceArray);Tree parseTree = lexicalizedParser.apply(words);parseTree.pennPrint();////////////////////////////////////TreePrint treePrint = new TreePrint("typedDependenciesCollapsed");treePrint.printTree(parseTree);

Finding word dependencies using the GrammaticalStructure class

//StanfordNLP//String sentence = "The cow jumped over the moon.";TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");Tokenizer<CoreLabel> tokenizer = tokenizerFactory.getTokenizer(new StringReader(sentence));List<CoreLabel> wordList = tokenizer.tokenize();parseTree = lexicalizedParser.apply(wordList);TreebankLanguagePack tlp = lexicalizedParser.treebankLanguagePack;GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();GrammaticalStructure gs = gsf.newGrammaticalStructure(parseTree);List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();System.out.println(tdl);//This information can also be extracted using the  gov ,  reln , and  dep methods//which return the governor word, the relationship, and the dependent element, respectivelyfor(TypedDependency dependency : tdl) {    System.out.println("Governor Word: [" + dependency.gov() + "] Relation: [" + dependency.reln().getLongName()     + "] Dependent Word: [" + dependency.dep() + "]");}

Finding coreference resolution entities

//StanfordNLP//String sentence = "He took his cash and she took her change "+ "and together they bought their lunch.";Properties props = new Properties();props.put("annotators","tokenize, ssplit, pos, lemma, ner, parse, dcoref");StanfordCoreNLP pipeline = new StanfordCoreNLP(props);Annotation annotation = new Annotation(sentence);pipeline.annotate(annotation);Map<Integer, CorefChain> corefChainMap = annotation.get(CorefChainAnnotation.class);Set<Integer> set = corefChainMap.keySet();Iterator<Integer> setIterator = set.iterator();while(setIterator.hasNext()) {    CorefChain corefChain = corefChainMap.get(setIterator.next());    System.out.println("CorefChain: " + corefChain);}System.out.print("ClusterId: " + corefChain.getChainID());CorefMention mention = corefChain.getRepresentativeMention();System.out.println(" CorefMention: " + mention + " Span: [" + mention.mentionSpan + "]");List<CorefMention> mentionList = corefChain.getMentionsInTextualOrder();Iterator<CorefMention> mentionIterator = mentionList.iterator();while(mentionIterator.hasNext()) {    CorefMention cfm = mentionIterator.next();    System.out.println("\tMention: " + cfm + " Span: [" + mention.mentionSpan + "]");    System.out.print("\tMention Mention Type: " + cfm.mentionType + " Gender: " + cfm.gender);    System.out.println(" Start: " + cfm.startIndex + " End: " + cfm.endIndex);}System.out.println();

Extracting relationships for a question-answer system

This process consists of several steps:
1. Finding word dependencies
2. Identifying the type of questions
3. Extracting its relevant components
4. Searching the answer
5. Presenting the answer

//StanfordNLPString question = "Who is the 32nd president of the United States?";String parserModel = ".../englishPCFG.ser.gz";LexicalizedParser lexicalizedParser = LexicalizedParser.loadModel(parserModel);TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");Tokenizer<CoreLabel> tokenizer = tokenizerFactory.getTokenizer(new StringReader(question));List<CoreLabel> wordList = tokenizer.tokenize();Tree parseTree = lexicalizedParser.apply(wordList);TreebankLanguagePack tlp = lexicalizedParser.treebankLanguagePack();GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();GrammaticalStructure gs = gsf.newGrammaticalStructure(parseTree);List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();System.out.println(tdl);for (TypedDependency dependency : tdl) {    System.out.println("Governor Word: [" + dependency.gov() + "] Relation: [" + dependency.reln().getLongName() + "] Dependent Word: [" + dependency.dep() + "]");}//Determining the question typefor (TypedDependency dependency : tdl) {    if ("nominal subject".equals(dependency.reln().getLongName())    && "who".equalsIgnoreCase(dependency.gov().originalText()))     {        processWhoQuestion(tdl);    }}
0 0
原创粉丝点击