A Bayesian network (BN) is a directed probabilistic graph model that is used to model variable dependency relationships. Over 50 learning algorithms exist for BNs. This paper proposes a big data-focused BN model learning algorithm: the parallel ensemble-based Bayesian network learning algorithm (PENBays). PENBays combines a new arc score for data quality assessment prior to being modeled via BN, ensemble learning, or the distributed computing model. The datasets that are suitable for BN learning are determined to be the ones with an arc score larger than -0.5. Five large datasets (more than 500 million rows) are used to show that PENBays outperforms three other BN model learning algorithms: max-min-hill-climbing (MMHC), three-phase dependency analysis algorithm (TPDA), and REC. The structure hamming distance (SHD) metric is used for comparison. SHD is the number of edge insertions, deletions, or flips necessary to transform one graph into another graph.
A comparison of performance with respect to execution time and predictive accuracy is missing and would have been good information to add to this study. In addition, the impact of the low arc score on the algorithm performance would be useful to determine how important the score is to the overall performance. The key points are introduced succinctly in this paper before being used.
I noticed a typographical error: “In contrary, data set which yields had BN structure has very low Formula, generally smaller than -0.5, or sometimes -1 or even -2. Hence, Formula could be used as an suitable measure for data set quality.” This should read: “On the contrary, the data set that yields bad BN structure has very low Formula, generally smaller than -0.5, or sometimes -1 or even -2. Hence, Formula could be used as a suitable measure for data set quality.”