Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
A hit-miss model for duplicate detection in the WHO drug safety database
Norén G., Orre R., Bate A.  Knowledge discovery in data mining (Proceeding of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, Illinois, USA, Aug 21-24, 2005)459-468.2005.Type:Proceedings
Date Reviewed: May 4 2006

Data cleaning is an essential first step in the knowledge discovery in databases (KDD) process. Apart from the removal of noise, another critical preprocessing task is the removal of duplicate records from the databases in question. The application of interest to the authors is drug safety, although the techniques they describe have wider applicability.

Norén and coauthors use Copas and Hilton’s hit-miss model [1] for statistical record linkage within the World Health Organization’s (WHO’s) drug safety database. They note in passing that most of the parameters needed for this model are determined by the entire data set, which reduces the risk of overfitting. Moreover, they found that adding the following features improved the performance of the standard hit-miss model: modeling errors in numerical record fields, and incorporating a computationally efficient method of handling correlated record fields.

A total of 38 groups of duplicate records had been previously (manually) identified in the WHO drug safety database. The authors’ modified hit-miss model was applied retrospectively to this database. This led, first, to the identification of the most likely duplicates for a given record (with 94.7 percent accuracy), and, second, to discriminating duplicates from random matches (with 63 percent recall and 71 percent precision). In short, they claim to be able to detect a “significant proportion of duplicates without generating many false leads.” The authors plan to perform a prospective study at some point in the future, using their modified hit-miss model to highlight suspected duplicates in an unlabeled data subset, following up their results with a manual review.

This paper will appeal to researchers with an interest in KDD, especially in preprocessing in general, and in duplicate record elimination in particular.

Reviewer:  John Fulcher Review #: CR132739 (0703-0287)
1) Copas, J.; Hilton, F. Record linkage: statistical models for matching computer records. J. Royal Statistical Society, Series-A 153, 3(1990), 287–320.
Bookmark and Share
  Featured Reviewer  
 
Miscellaneous (H.2.m )
 
 
Data Mining (H.2.8 ... )
 
 
Health (J.3 ... )
 
 
Record Classification (H.3.2 ... )
 
 
Statistical Computing (G.3 ... )
 
 
Database Administration (H.2.7 )
 
Would you recommend this review?
yes
no
Other reviews under "Miscellaneous": Date
Data management support for database management
Bayer R., Schlichtiger P. Acta Informatica 21(1): 1-28, 1984. Type: Article
Mar 1 1985
Extracting the extended entity-relationship model from a legacy relational database
Alhajj R. Information Systems 28(6): 597-618, 2003. Type: Article
Oct 23 2003
Static analysis techniques for predicting the behavior of active database rules
Aiken A., Hellerstein J., Widom J. ACM Transactions on Database Systems 20(1): 3-41, 1995. Type: Article
Jan 1 1996
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy