Computing Reviews, the leading online review service for computing literature.

Search

Automatic discovery of abnormal values in large textual databases
Christen P., Gayler R., Tran K., Fisher J., Vatsalan D. Journal of Data and Information Quality7 (1-2):1-31,2016.Type:Article

Date Reviewed: Sep 8 2016

Imagine a world where all databases only include predefined entry values where everything works with only enumerations. You don’t need to think about how you can understand what the data is saying, and you don’t need to care about how you can correct grammatical errors or inconsistent values that actually are the same. However, we live in the real world, where data is messy, inconsistent, and made up of a lot of unstructured text that we need to process to try to get a sense of what it is about. This is where data cleaning research comes into play: we need to discover abnormal values first, and then try to find ways to correct them so that they make more sense. Christen et al. offer different techniques to automatically discover abnormal values in textual databases by reducing the problem to an outlier detection problem. They assume that the abnormal values appear less frequently than what is considered normal. They start by adopting a q-gram based model (basic set intersection, BSI), which offers a very basic technique to discover what kind of substrings of length q appears less frequently than the others. Due to its simplicity, the authors offer a second technique to improve it and make use of its output to create a probabilistic language model (PLM) that calculates the likelihood of values to appear in the database frequently. Finally, they offer a third technique that uses a support vector machine (SVM) classifier to analyze and identify distinguishing features, namely outliers in the morphological word structures. The experiments consist of application of all of the techniques over four large real-world datasets: the North Carolina Voter Registration database for two different dates, the 2013 KDD Cup data, and the Memetracker dataset from the Stanford Network Analysis Project. They provide detailed comparative analysis of the techniques on these datasets even though they lack ground truth information. Because it is impossible to have a ground truth to calculate the accuracy of the methods, they employ a comparative manual evaluation on randomly selected values classified as normal and abnormal with all three techniques. The results suggest that PLM performs better than BSI as expected, while the success of SVM depends upon the features that are customizable. All in all, the paper provides an extensive survey on the existing methods, and a detailed description and discussion of the methods proposed. I recommend this paper for the following reasons: (1) the baseline assumption that abnormal values should appear less frequently may not always be true, but it is usually the case; (2) researchers who have doubts about the quality of their datasets can easily perform a quick evaluation of their datasets with a tool that performs the proposed techniques, which the authors made available online; and (3) data cleaning is a rigorous and complicated task, and the analysis provided in the paper will help researchers understand many aspects of the most current methodologies available.

Reviewer: Gökhan Kul	Review #: CR144742 (1612-0903)

General (H.2.0 )

Would you recommend this review?

yes

Other reviews under "General":	Date

Design of the Mneme persistent object store Moss J. ACM Transactions on Information Systems 8(2): 103-139, 2001. Type: Article	Jul 1 1991

Database management systems Gorman M., QED Information Sciences, Inc., Wellesley, MA, 1991. Type: Book (9780894353239)	Dec 1 1991

Database management (3rd ed.) McFadden F., Hoffer J., Benjamin-Cummings Publ. Co., Inc., Redwood City, CA, 1991. Type: Book (9780805360400)	Jun 1 1992

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy