Intelligent data mining algorithms call for reliable indicators of relationships in massive datasets. How should correlations be selected for the precise analysis of binary data from different problem areas? Duan et al. critique the strengths and weaknesses of numerous correlation statistics. They offer properties that: guarantee the existence of association patterns beyond any doubt and make extremely related item sets noticeable in binary data investigations; validate the accurate estimation of negative correlations; and provide confidence about computed correlations, irrespective of any sample size increase. Computational statisticians ought to read this astounding paper.
Correlation support is the percentage of item coincidences. Let SDFES be the squared deviation of the fixed support from the expected support. Simplified chi-square, an overall gauge of association among items, is the absolute value of the fixed item set size times SDFES divided by the expected support. The authors present two correlation test statistics: the product of the simplified chi-square and fixed support, and the product of the fixed item set size and SDFES divided by the sum of the expected support and a continuity correction to diminish its instability. They provide formulas of the exact upper and lower bounds of 18 correlation measures and graphically illuminate the bounds for various support and data sizes.
Correlated pair and item set search experiments were performed with synthetic patient and Facebook datasets. The average correlation support of the topmost pairs retrieved and the mean average precision were computed. The proposed two test statistics produced reliable search results. However, in correlated item set searches with simulated and Netflix datasets of items, movies, and transactions, the simplified chi-square with fixed support was less accurate in retrieval performance, due to the uncertainty of the data patterns. The authors credibly articulate the germane properties and correlation measures for skillfully probing different binary datasets.