Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Selecting the right correlation measure for binary data
Duan L., Street W., Liu Y., Xu S., Wu B. ACM Transactions on Knowledge Discovery from Data9 (2):1-28,2014.Type:Article
Date Reviewed: May 1 2015

Intelligent data mining algorithms call for reliable indicators of relationships in massive datasets. How should correlations be selected for the precise analysis of binary data from different problem areas? Duan et al. critique the strengths and weaknesses of numerous correlation statistics. They offer properties that: guarantee the existence of association patterns beyond any doubt and make extremely related item sets noticeable in binary data investigations; validate the accurate estimation of negative correlations; and provide confidence about computed correlations, irrespective of any sample size increase. Computational statisticians ought to read this astounding paper.

Correlation support is the percentage of item coincidences. Let SDFES be the squared deviation of the fixed support from the expected support. Simplified chi-square, an overall gauge of association among items, is the absolute value of the fixed item set size times SDFES divided by the expected support. The authors present two correlation test statistics: the product of the simplified chi-square and fixed support, and the product of the fixed item set size and SDFES divided by the sum of the expected support and a continuity correction to diminish its instability. They provide formulas of the exact upper and lower bounds of 18 correlation measures and graphically illuminate the bounds for various support and data sizes.

Correlated pair and item set search experiments were performed with synthetic patient and Facebook datasets. The average correlation support of the topmost pairs retrieved and the mean average precision were computed. The proposed two test statistics produced reliable search results. However, in correlated item set searches with simulated and Netflix datasets of items, movies, and transactions, the simplified chi-square with fixed support was less accurate in retrieval performance, due to the uncertainty of the data patterns. The authors credibly articulate the germane properties and correlation measures for skillfully probing different binary datasets.

Reviewer:  Amos Olagunju Review #: CR143406 (1507-0610)
Bookmark and Share
  Reviewer Selected
Featured Reviewer
 
 
Data Mining (H.2.8 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Data Mining": Date
Feature selection and effective classifiers
Deogun J. (ed), Choubey S., Raghavan V. (ed), Sever H. (ed) Journal of the American Society for Information Science 49(5): 423-434, 1998. Type: Article
May 1 1999
Rule induction with extension matrices
Wu X. (ed) Journal of the American Society for Information Science 49(5): 435-454, 1998. Type: Article
Jul 1 1998
Predictive data mining
Weiss S., Indurkhya N., Morgan Kaufmann Publishers Inc., San Francisco, CA, 1998. Type: Book (9781558604032)
Feb 1 1999
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy