Cluster computing is a major trend in scientific high-performance computing (HPC), and the recent evolution of cluster computing requires a revision to the models and methods for evaluating operational performance. The paper’s main achievement is investigating the taxonomy of analytical communication performance modeling based on communication cost in clusters.
The introduction reviews new modeling methodologies for prediction accuracy, mostly based on the message passing interface (MPI) method for point-to-point (P2P) and collective operations. Next, with a network node communication theme, the paper discusses the simple postal model to logGP, featured optimized scheduling algorithms in runtime libraries, derived modeling to mitigate the issue of accurate MPI cost prediction, channel contention and multicore node problems in the network hierarchy, and the heterogeneity of the platform. The progress of different communication models with related parameters--network delay, overhead, gap per message, and number of processors in the cluster--are pictorially expressed, and a topological discussion considers the issue of network hierarchy and related message transferal policies like inter-cluster and intra-cluster communications. Section 3.3, “Communication Contention,” covers node contention, link contention, and controller performance bottlenecks in distributed shared memory (DSM) machines. Platform heterogeneity, middleware costs, scalabilty, and domain generality versus specificity are also discussed.
The paper then discusses bridging the gap between the analytical description of a model and its experimental description with empirical parameters, along with a literature review of the related measurement methods. Moreover, the paper introduces a framework to evaluate building methods and best practices to evaluate measurement methods. A section is devoted to enhancing the estimation accuracy of a communication performance model, discussing the factors that influence model performance. The paper ends with a conclusion and future research.
The paper provides a comprehensive view of the evolution, estimation, and analysis of cluster performance modeling in the HPC ecosystem.