Information Extration: Quality Matter. A paper review

来源：互联网发布：mac mysql dmg 编辑：程序博客网时间：2024/04/30 10:32

Paper Reivew
Xu Chu
January 30, 2011
1
1.1
A Quality-Aware Optimizer for Information
Basics
An IE system takes input a document, and output a set of tuples. IE has to
define a extration pattern first. IE system has a few decision threshold, or we
call ”knob”. For instance, the number of rules, minimum confidence level of the
extracted tuples, etc.. We generally use θ to denote the parameters.
Document Retrieval strategies also affect the output quality of the IE system
since the documents retrieved by the strategy is processed later by the IE system.
There are three kinds of document retrievel strategies. The first is scan, the
second is filtered scan, the third is Automatic Query Evaluation(QXtract).
We use output quality, which means the number of good tuples and bad tuples,
as well as execution time to model the quality of an IE system coupled with a
document retrievel strategy.
1.2
Characterize Output Quality
One traditional method is to use precison-recall curves. Precision characterize
perfectness, while recall characterize completeness. However, precision depends
heavily on the distribution of good and bad documents in the test set.
So, we use ROC curve instead, which graphically summarize the trade-off be-
tween different types of errors. True positive rate is the fraction of positives
correctly classified as positives, and false positive rate is fraction of negatives
incorrectly classfied as positives.
The next step is to develop an algorithm to generate ROC curve for a stand-
alone IE system. There are three problems we need to deal with. The first is
that the computation of fp rate needs to know the total number of bad tuples,
which is hard to count since in principle it is infinite. The solution is to use
a pooling-based approach. The second problem is that after computing each
point in the curve, i.e. < tp(θ), f p(θ) >, we don’t have a confidence bound for
the point. The solution is to use ten-fold cross valiation. The third problem is
that while we use θ to characterize the IE settings, the fact is that in practice
there are mutiple ”knobs”. The solution is that we only choose pareto-optimal
settings.
1
Finally, we need to incorporate document retrieval strategies into our ROC
curve analysis. We now define the quality curve(E,R,P). In which E is the IE
system, R is document retrieval strategy, and P is the time point.
1.3
1.3.1
Estimating Output Quality
Statistics Backgrounds
hypergeometric distribution. http://en.wikipedia.org/wiki/Hypergeometric_
distribution
power-law distribution. http://en.wikipedia.org/wiki/Power_law
binomial distribution. http://en.wikipedia.org/wiki/Binomial_distribution
1.4
Estimating Model Parameters
Retrieval-strategy-specific parameters can be typically estimated using a simple
testing phase.
Database-specific parameters relies on the general principle of Maximum-
Likelihoond Estimation(MLE).
1.4.1
Assuming that we know a observed tuple is good or not
First, we estimate |Dg |, |Db |, |De |, using |Dr |, |Dgp |, |Dbp |, |Dep |.
Second, we estimate tuple related parameters. Estimaing power-law distribution
parameters for gd(t), bd(t). Finally, we estimate |Tgood |, |Tbad |
1.4.2
Don’t know a tuple is good or not
Technique1:Rejection-Sampling-Based MLE-Partitioning Approach, similar to
Monte-Carlo.
Technique2:Uncertainty-Preserving Approach.
2 Join Optimization of IE output: Quality Mat-
ters
3 An Algebraic Approach to Rule-Based Infor-
mation Extraction
Rule-based information extraction systems suffer from scalability problems.
This paper proposes a algebraic approach to address such problems.
2
3.1
Data Model and Algebraic Operators
span is the basic data struct, a finite sequence of spans constitute a tuple, and
a set of tuples of the same width constitute a relation.
In addition to the traditional relational operators, this paper introduces another
two kinds of operators, i.e. Span extraction operators and Span aggragation op-
erators.
• εre :Standard Regular Expression Matcher.
• εd :Dictionary Matcher
• Ωo :Overlap Consolidation is used to produce new spans by merging over-
lapping spans
• Ωc :Containment Consolidation is used to discard annotation spans that
are wholly contained within other annotation spans.
• β:Block identifies a large span of text enclosing a set of input spans that
no two successive spans are more than a certain distance apart.
3.2
Optimization
In addition to regular relational operator optimization techniques such as select
pushing down, this paper proposes three optimzation techniques.
Shared Dictionary Matching(SDM) The goal is to avoid multiple scan of
the same dictionary. Furthermore, a single scan over a document can also
match multiple dictionaries simultaneously.
Conditional Evalution(CE) The goal is avoid some subqueries if we already
know that this subquery wouldn’t produce any result because of some
restriction imposed earlier.
Restricted Span Extraction(RSE) The idea is to avoid scan the entire text,
when taking advantage of some previous known facts. There are some
minor complications we need to deal with. Please refer to the original
paper.
3