entity resolution

来源:互联网 发布:ubuntu u盘系统 编辑:程序博客网 时间:2024/04/30 17:53

Stanford
Entity
Resolution 
Framework


OverviewPapersSoftwarePeople

News

  • Jan. 2012: Our paper on Pay-As-You-Go ER [11] has been accepted to the IEEE Transactions on Knowledge and Data Engineering.

Overview

The goal of the SERF project is to develop a generic infrastructure for Entity Resolution (ER). ER (also known as deduplication, or record linkage) is an important information integration problem: The same "real-world entities" (e.g., customers, or products) are referred to in different ways in multiple data records. For instance, two records on the same person may provide different name spellings, and addresses may differ. The goal of ER is to "resolve" entities, by identifying the records that represent the same entity and reconciling them to obtain one record per entity.

In our approach, the functions that "match" records (i.e. decide whether they represent the same entity) and "merge" them are viewed as black-boxes, which permits generic, extensible ER solutions. This generic setting makes ER resemble a database join operation (of the initial set of records with itself), but there are two main differences: (a) in general, we have no knowledge about which records may match, so all pairs of records need to be compared using the match function, and (b) merged records may lead us to discover new matches, therefore a "feed-back loop" must compare them against the rest of the data set.

Some of the challenges we are addressing in the SERF project include:

  • Performance: Entity resolution algorithms must perform a very large number of comparisons. We identified simple and reasonable properties of the match and merge functions that enable efficient processing, and developed optimal algorithms (see [1]).
  • Distribution: As ER is a compute-intensive process, we develop algorithms for distributing the ER workload across multiple processors. When available, we exploit domain knowledge in the distribution of ER(see [2]).
  • Secondary storage: We are developing optimizations to efficiently perform ER when the dataset resolved by one processor does not fit into main memory to fetch and write records to disk as efficiently as possible (see [3]).
  • Numerical confidences: We consider numerical confidences associated with data records, and extend our framework to manipulate and combine these confidences as records are matched and merged. New algorithms are needed to perform ER efficiently when confidences are involved (see [4]).
  • Negative information: ER can be viewed as a non-monotonic incremental process where previous match or merge decisions may be reconsidered as further records are processed. Maintaining the history of record derivations is key to managing these revisions consistently and efficiently (see [5]).
  • Blocking: We are developing iterative blocking techniques to significantly enhance the ER performance as well as accuracy. When processing a block, we exploit the ER results of previously processed blocks (see [7]).
  • ER Measures: We explore a configurable ER measure (inspired by edit distance) that can accurately evaluate ER results. (see [8]).
  • Trio-ER: The Trio-ER system is a new variant of the Trio system tailored specifically as a workbench for entity resolution. (see [9]).
  • Evolving Rules: When writing ER applications, the rule for comparing records may change frequently with better understanding of the data, schema, and application. We investigate how to efficiently update an ER result given a new rule for comparing records (see [10]).
  • Pay-As-You-Go ER: Many ER applications need to resolve large data sets efficiently, but do not require the ER result to be exact. We investigate techniques for maximizing the ER quality with minimal work (see [11]).
  • Joint ER: We are developing scalable ER techniques that resolve multiple domains at the same time (see [12]).
  • Disinformation: We are developing disinformation techniques to "dilute" sensitive information that has already been leaked to the public and cannot be deleted (see [14]).

Papers

[1] Swoosh: A Generic Approach to Entity Resolution
Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, Jennifer Widom. The VLDB Journal, vol. 18, no. 1, pp. 255-276, Jan. 2009. (available here)

[2] D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution
Omar Benjelloun, Hector Garcia-Molina, Heng Gong, Hideki Kawai, Tait Larson, David Menestrina, Sutthipong Thavisomboon. In 27th IEEE International Conference on Distributed Computing Systems (ICDCS), June 2007. (available here)

[3] Bufoosh: Buffering Algorithms for Generic Entity Resolution
Hideki Kawai, Hector Garcia-Molina, Omar Benjelloun, Tait Larson, David Menestrina, Suthipong Thavisomboon. Technical Report, 2006 (available here)

[4] Generic Entity Resolution with Data Confidences
David Menestrina, Omar Benjelloun, Hector Garcia-Molina. In First International VLDB Workshop on Clean Databases, Seoul, Korea, September 2006. (availablehere)

[5] Generic Entity Resolution with Negative Rules
Steven Euijong Whang, Omar Benjelloun, Hector Garcia-Molina. The VLDB Journal, vol. 18, no. 6, pp. 1261-1277, Feb. 2009. (available here)

[6] Generic Entity Resolution in the SERF Project
Omar Benjelloun, Hector Garcia-Molina, Hideki Kawai, Tait Eliott Larson, David Menestrina, Qi Su, Sutthipong Thavisomboon, Jennifer Widom. IEEE Data Engineering Bulletin, vol. 29, no. 2, pp. 13-20, June 2006. (available here)

[7] Entity Resolution with Iterative Blocking
Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, Hector Garcia-Molina. In Proc. 2009 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD), pp. 219-232, Providence, Rhode Island, June 2009. (available here)

[8] Evaluating Entity Resolution Results
David Menestrina, Steven Euijong Whang, Hector Garcia-Molina. In Proc. 36th Int'l Conf. on Very Large Data Bases (PVLDB), pp. 208-219, Singapore, Sept. 2010. (available here)

[9] Trio-ER: The Trio System as a Workbench for Entity-Resolution
Parag Agrawal, Robert Ikeda, Hyunjung Park, Jennifer Widom. Technical Report, 2009. (available here)

[10] Entity Resolution with Evolving Rules
Steven Euijong Whang, Hector Garcia-Molina. In Proc. 36th Int'l Conf. on Very Large Data Bases (PVLDB), pp. 1326-1337, Singapore, Sept. 2010. (available here)

[11] Pay-As-You-Go ER
Steven Euijong Whang, David Marmaros, Hector Garcia-Molina. To appear in IEEE Transactions on Knowledge and Data Engineering, 2012. (available here)

[12] Joint Entity Resolution
Steven Euijong Whang, Hector Garcia-Molina. To appear in Proc. 28th IEEE International Conference on Data Engineering (ICDE), Washington, DC, Apr. 2012. (available here)

[13] Developments in Generic Entity Resolution
Steven Euijong Whang, Hector Garcia-Molina. IEEE Data Engineering Bulletin, vol. 34, no. 3, pp. 51-59, Sept. 2011. (available here)

[14] Disinformation Techniques for Entity Resolution
Steven Euijong Whang, Hector Garcia-Molina. Technical Report, 2011. (available here)


Software

Our first release of the SERF software can be downloaded here.

This package provides an implementation of the R-Swoosh algorithm described in reference [1]. The algorithm takes as input a dataset of records (in XML) and a "MatcherMerger" class that implements functions to match and merge pairs of records, and returns a dataset of resolved records.

A sample dataset of product records, along with a simple MatcherMerger implementation are provided as an example. Products are matched based on the similarity of their titles and prices.

The source code is also included, and is released under the BSD license.


People

Faculty

  • Hector Garcia-Molina
  • Jennifer Widom

Students

  • Steven Whang

Alumns

  • Omar Benjelloun
  • Georgia Koutrika
  • David Menestrina
  • Tyson Condie
  • Johnson (Heng) Gong
  • Hideki Kawai
  • Tait E. Larson
  • Nicolas Pombourcq
  • Qi Su
  • Makoto Tachibana
  • Sutthipong Thavisomboon
  • Martin Theobald

0 0
原创粉丝点击