Using Parser to Extract
来源:互联网 发布:中国8月进出口数据库 编辑:程序博客网 时间:2024/05/20 10:22
Parsing
Parsing is the process of creating a parse tree for a textual unit
A parse tree is a hierarchical data structure that represents the syntactic structure of a sentence.
Parsing is used for many tasks, including:
- Machine translation of languages
- Synthesizing speech from text
- Speech recognition
- Grammar checking
- Information extraction
Coreference resolution is the condition where two or more expressions in text refer to the same individual or thing.
Relationship types
An interesting site that contains a multitude
of relationships is Freebase(https://www.freebase.com/). It is a database of people, places, and things organized by categories. The WordNet thesaurus (http://wordnet.princeton.edu/) contains a number of relationships
Two types of parsing:
- Dependency: This focuses on the relationship between words
- Phrase structure: This deals with phrases and their recursive structure
Dependencies can use labels such as subject, determiner, and prepositions to find
relationships.
Parsing techniques include shift-reduce, spanning tree, and cascaded chunking.
Understanding parse trees
Parse trees represent hierarchical relationships between elements of text. For example, a dependency tree shows the relationship between the grammatical elements of a sentence.
(ROOT (S (NP (DT The) (NN cow)) (VP (VBD jumped) (PP (IN over) (NP (DT the) (NN moon)))) (. .)))
Using extracted relationships
Think about 时下学界/业界关注的知识图谱,底层原理,应该是语言内部、语言及其所联系的背景知识的关系的model
Relationships extracted can be used for a number of purposes including:
- Building knowledge bases
- Creating directories
- Product searches
- Patent analysis
- Stock analysis
- Intelligence analysis
There are many databases built using Wikipedia that extract relationships and information
such as:
- Resource Description Framework (RDF): This uses triples such as Yosemite-location-California, where the location is the relation. This can be found at
http://www.w3.org/RDF/.
- DBPedia: This holds over one billion triples and is an example of a knowledge base created from Wikipedia. This can be found at http://dbpedia.org/About.
Extracting relationships
There are a number of techniques available to extract relationships. These can be grouped as follows:
- Hand-built patterns
- Supervised methods
- Semi-supervised or unsupervised methods
- Bootstrapping methods
- Distant supervision methods
- Unsupervised methods
Hand-built models are used when we have no training data.
If only a little training data is amiable, then the Naive Bayes classifier is a good choice. When more data is available, then techniques such as SVM, Regularized Logistic Regression, and Random forest can be used.
//OpenNLPString fileLocation = getModelDir() + "/en-parser-chunking.bin";try (InputStream modelInputStream = new FileInputStream(fileLocation);) { ParserModel model = new ParserModel(modelInputStream); Parser parser = ParserFactory.create(model); String sentence = "The cow jumped over the moon"; //return the top three parses Parse parses[] = ParserTool.parseLine(sentence, parser, 3); for(Parse parse : parses) { parse.show(); parse.showCodeTree(); System.out.println("Probability: " + parse.getProb()); Parse children[] = parse.getChildren(); for (Parse parseElement : children) { System.out.println(parseElement.getText()); System.out.println(parseElement.getType()); Parse tags[] = parseElement.getTagNodes(); System.out.println("Tags"); for (Parse tag : tags) { System.out.println("[" + tag + "]" + " type: " + tag.getType() + " Probability: " + tag.getProb() + " Label: " + tag.getLabel()); } } }} catch (IOException ex) { // Handle exceptions}
//StanfordNLPString parserModel = ".../models/lexparser/englishPCFG.ser.gz";LexicalizedParser lexicalizedParser = LexicalizedParser.loadModel(parserModel);String[] senetenceArray = {"The", "cow", "jumped", "over","the", "moon", "."};List<CoreLabel> words = Sentence.toCoreLabelList(senetenceArray);Tree parseTree = lexicalizedParser.apply(words);parseTree.pennPrint();////////////////////////////////////TreePrint treePrint = new TreePrint("typedDependenciesCollapsed");treePrint.printTree(parseTree);
Finding word dependencies using the GrammaticalStructure class
//StanfordNLP//String sentence = "The cow jumped over the moon.";TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");Tokenizer<CoreLabel> tokenizer = tokenizerFactory.getTokenizer(new StringReader(sentence));List<CoreLabel> wordList = tokenizer.tokenize();parseTree = lexicalizedParser.apply(wordList);TreebankLanguagePack tlp = lexicalizedParser.treebankLanguagePack;GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();GrammaticalStructure gs = gsf.newGrammaticalStructure(parseTree);List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();System.out.println(tdl);//This information can also be extracted using the gov , reln , and dep methods//which return the governor word, the relationship, and the dependent element, respectivelyfor(TypedDependency dependency : tdl) { System.out.println("Governor Word: [" + dependency.gov() + "] Relation: [" + dependency.reln().getLongName() + "] Dependent Word: [" + dependency.dep() + "]");}
Finding coreference resolution entities
//StanfordNLP//String sentence = "He took his cash and she took her change "+ "and together they bought their lunch.";Properties props = new Properties();props.put("annotators","tokenize, ssplit, pos, lemma, ner, parse, dcoref");StanfordCoreNLP pipeline = new StanfordCoreNLP(props);Annotation annotation = new Annotation(sentence);pipeline.annotate(annotation);Map<Integer, CorefChain> corefChainMap = annotation.get(CorefChainAnnotation.class);Set<Integer> set = corefChainMap.keySet();Iterator<Integer> setIterator = set.iterator();while(setIterator.hasNext()) { CorefChain corefChain = corefChainMap.get(setIterator.next()); System.out.println("CorefChain: " + corefChain);}System.out.print("ClusterId: " + corefChain.getChainID());CorefMention mention = corefChain.getRepresentativeMention();System.out.println(" CorefMention: " + mention + " Span: [" + mention.mentionSpan + "]");List<CorefMention> mentionList = corefChain.getMentionsInTextualOrder();Iterator<CorefMention> mentionIterator = mentionList.iterator();while(mentionIterator.hasNext()) { CorefMention cfm = mentionIterator.next(); System.out.println("\tMention: " + cfm + " Span: [" + mention.mentionSpan + "]"); System.out.print("\tMention Mention Type: " + cfm.mentionType + " Gender: " + cfm.gender); System.out.println(" Start: " + cfm.startIndex + " End: " + cfm.endIndex);}System.out.println();
Extracting relationships for a question-answer system
This process consists of several steps:
1. Finding word dependencies
2. Identifying the type of questions
3. Extracting its relevant components
4. Searching the answer
5. Presenting the answer
//StanfordNLPString question = "Who is the 32nd president of the United States?";String parserModel = ".../englishPCFG.ser.gz";LexicalizedParser lexicalizedParser = LexicalizedParser.loadModel(parserModel);TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");Tokenizer<CoreLabel> tokenizer = tokenizerFactory.getTokenizer(new StringReader(question));List<CoreLabel> wordList = tokenizer.tokenize();Tree parseTree = lexicalizedParser.apply(wordList);TreebankLanguagePack tlp = lexicalizedParser.treebankLanguagePack();GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();GrammaticalStructure gs = gsf.newGrammaticalStructure(parseTree);List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();System.out.println(tdl);for (TypedDependency dependency : tdl) { System.out.println("Governor Word: [" + dependency.gov() + "] Relation: [" + dependency.reln().getLongName() + "] Dependent Word: [" + dependency.dep() + "]");}//Determining the question typefor (TypedDependency dependency : tdl) { if ("nominal subject".equals(dependency.reln().getLongName()) && "who".equalsIgnoreCase(dependency.gov().originalText())) { processWhoQuestion(tdl); }}
- Using Parser to Extract
- Using resnet50 to extract feature
- How To Troubleshoot Oracle Redo Log Reading Extract Slow Performance Issue using TESTMAPPINGSPEED (文
- Extract CNN features using Caffe
- Step By Step guide to Read XML file in Java Using SAX Parser Example
- Using DataPump Extract Schema DDL Scripts
- Extract image convolution features using VGG11 & Pytorch
- Using the Windows Forms XML Parser Sample
- Java Reading XML using DOM parser
- NLP之Stanford Parser using NLTK
- Extract
- Extract
- EXTRACT
- extract
- extract
- Extract
- extract
- extract
- 如何使用openssl生成RSA公钥和私钥对
- Codevs 1066 引水入城 2010年NOIP全国联赛提高组 BFS + 贪心
- 时间类型详解
- MFC--修改图标/开机动画/Combo控件
- iOS 多线程(二)NSThread
- Using Parser to Extract
- c#平台下singleton单件模式
- class文件简介
- Swift - 图片处理库ImageHelper详解(扩展UIImage,UIImageView)
- android 实现类似选项卡的UI
- 菜鸟窝-仿京东淘宝项目学习笔记(二)ToolBar的基本使用
- glide源码编译
- 试试手
- 欢迎使用CSDN-markdown编辑器