Hardware accelerators are widely used in high-performance computing (HPC) and dominate the computer clusters in the TOP500 supercomputer list (http://www.top500.org/). A heterogeneous cluster consisting of different accelerators provides better performance, as certain tasks can be mapped to specialized accelerators, achieving cumulative application speedup. However, there are challenges in this approach: communication efficiency, workload balancing, and application portability.
The communication inefficiency is due to the slow and high-latency bus connecting the hardware accelerators to the main processor. Tsoi and Luk introduce Axel, a heterogeneous cluster consisting of field-programmable gate arrays (FPGAs) and graphics processing units (GPUs). Axel uses the peripheral component interconnect express (PCIe) system bus to communicate between the FPGAs and GPUs. There are two levels of communication between the nodes: on a system level, the nodes communicate through the gigabit Ethernet interface, and the FPGAs have a second internode network using high-bandwidth InfiniBand cables.
The authors partition the application dataset into subsets that are processed by the different nodes; it is not clear if this is done automatically. The application is rewritten using a MapReduce framework and partitioned into different tasks targeted on FPGAs and GPUs. Tsoi and Luk introduce a hardware abstraction model that captures the computation, local memory, and communication requirements of each processing element (PE). If the model does not meet the performance constraints of the application, the data/task partitioning steps are repeated. The system uses a message passing interface (MPI) layer for data initialization and communication. The authors demonstrate the efficiency and the performance of this heterogeneous cluster using an N-body simulation example. They implement the application using the following configurations: a single-thread central processing unit (CPU), a multithread CPU (using OpenMP), a GPU, an FPGA, and a collaborative execution with single node/multiple nodes. During collaborative execution, the bottleneck results from the latency of the slowest PE to complete the execution and synchronization efforts between the processes mapped to the FPGA and GPU. The pipelined FPGA implementation of the N-body simulation uses only three percent of the logic resources. This enables ten copies of the pipelined path to fit on a single FPGA, achieving 10-core parallelism. Therefore, the 10-core FPGA implementation provides the fastest acceleration individually for the N-body simulation and, at the same time, required over a month of development time.
Overall, this is a good starting point for researchers who are looking to accelerate HPC applications using heterogeneous clusters. It provides a clear picture of parallelism in FPGAs and GPUs.