Computing Reviews, the leading online review service for computing literature.

Search

Selecting the right correlation measure for binary data
Duan L., Street W., Liu Y., Xu S., Wu B. ACM Transactions on Knowledge Discovery from Data9 (2):1-28,2014.Type:Article

Date Reviewed: May 1 2015

Intelligent data mining algorithms call for reliable indicators of relationships in massive datasets. How should correlations be selected for the precise analysis of binary data from different problem areas? Duan et al. critique the strengths and weaknesses of numerous correlation statistics. They offer properties that: guarantee the existence of association patterns beyond any doubt and make extremely related item sets noticeable in binary data investigations; validate the accurate estimation of negative correlations; and provide confidence about computed correlations, irrespective of any sample size increase. Computational statisticians ought to read this astounding paper. Correlation support is the percentage of item coincidences. Let SDFES be the squared deviation of the fixed support from the expected support. Simplified chi-square, an overall gauge of association among items, is the absolute value of the fixed item set size times SDFES divided by the expected support. The authors present two correlation test statistics: the product of the simplified chi-square and fixed support, and the product of the fixed item set size and SDFES divided by the sum of the expected support and a continuity correction to diminish its instability. They provide formulas of the exact upper and lower bounds of 18 correlation measures and graphically illuminate the bounds for various support and data sizes. Correlated pair and item set search experiments were performed with synthetic patient and Facebook datasets. The average correlation support of the topmost pairs retrieved and the mean average precision were computed. The proposed two test statistics produced reliable search results. However, in correlated item set searches with simulated and Netflix datasets of items, movies, and transactions, the simplified chi-square with fixed support was less accurate in retrieval performance, due to the uncertainty of the data patterns. The authors credibly articulate the germane properties and correlation measures for skillfully probing different binary datasets.

Reviewer: Amos Olagunju	Review #: CR143406 (1507-0610)

Data Mining (H.2.8 ... )

Would you recommend this review?

yes

Other reviews under "Data Mining":	Date

Feature selection and effective classifiers Deogun J. (ed), Choubey S., Raghavan V. (ed), Sever H. (ed) Journal of the American Society for Information Science 49(5): 423-434, 1998. Type: Article	May 1 1999

Rule induction with extension matrices Wu X. (ed) Journal of the American Society for Information Science 49(5): 435-454, 1998. Type: Article	Jul 1 1998

Predictive data mining Weiss S., Indurkhya N., Morgan Kaufmann Publishers Inc., San Francisco, CA, 1998. Type: Book (9781558604032)	Feb 1 1999

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy