This research paper addresses the problem of matching records from different unreliable data sources. It introduces new concepts of matching dependencies (MDs) and relative candidate keys (RCKs), and presents the calculus of MDs. The calculus consists of formalization of MDs, a reasoning mechanism for deriving MDs from other MDs, and algorithms for MDs. The first algorithm is used to determine if an MD can be derived from given MDs; the second algorithm is used to deduce RCKs from MDs.
The first section introduces the problem and provides a list of contributions. It also shows some of the applications in which the work can be applied. The second section concerns related works and points out references for works on record matching. The third section formalizes MDs, the semantics of MDs, and the RCKs. The fourth section provides an inference system for MDs and rules for capturing dynamic semantics of MDs. The fifth section provides the MD deduction analysis algorithm with its complexity analysis. The sixth section covers the RCKs deduction algorithm. A detailed discussion on the experimental evaluation of the algorithms is given in the next section. The concluding section is followed by an appendix, which contains proofs related to the MDs inference system.
Knowledge of logic and complexity theory is required to understand the theory given in the paper.