Computing Reviews, the leading online review service for computing literature.

Search

A hybrid parallel Barnes-Hut algorithm for GPU and multicore architectures
Hannak H., Hochstetter H., Blochinger W. Euro-Par 2013 (Proceedings of the 19th International Conference on Parallel Processing, Aachen, Germany, Aug 26-30, 2013)559-570.2013.Type:Proceedings

Date Reviewed: Jan 8 2014

Modularization helps to identify data structures that efficiently work with heterogeneous models where both central processing units (CPUs) and graphics processing units (GPUs) are used. This paper describes a modularized parallelization of the Barnes-Hut algorithm, in which the authors divide the algorithm into a sequence of distinct steps, each of which depends on the outcome of preceding ones. The paper has several strengths. The background of the problem is clear and well written. Data access patterns, computations, and decisions about CPU or GPU usage for certain steps are described in detail. Three systems with different configurations are evaluated. The results show promising improvements in performance over purely sequential and parallel CPU implementations, especially on System II. A common technique of overlapping the CPU and GPU is used to hide memory latencies to and from the GPU. The main concern I have with the paper is that the evaluation section is somewhat shorter than expected. There is very little analysis of the reported results and the authors provide almost no evidence or references to back up the proposed explanation. For example, System II achieves speedups of almost 60 times in the best case, while the other two systems improve seven to eight times at most. The authors note that the CPU and GPU are more powerful in those systems, but provide no numerical evidence to support that claim. Furthermore, the GPU in System III is described as highly capable, which makes it difficult to understand why that system cannot show better performance. Another big drawback is that the parallelization of the pure CPU version (CCCCCC) is not described. The N-body computations are highly parallelizable in the sense that streaming single instruction, multiple data (SIMD) extension vectorization can be easily applied for superior parallel CPU performance. Reported results show speedups of less than half of the thread count in CPUs, which makes the parallelization effort questionable for the pure CPU case. For clarification and correction, the last paragraph in Section 3.2 mentions that GPUs use relatively “slow memory.” This statement is unclear and contradicts the CUDA programming guide. I would have appreciated it if the authors had provided references and explained this in more depth. Based on these contributions and downsides, I give this paper a weak recommendation.

Reviewer: Adnan Ozsoy	Review #: CR141874 (1404-0271)

Parallel Processors (C.1.2 ... )

Graphics Processors (I.3.1 ... )

Multiprocessing/ Multiprogramming/ Multitasking (D.4.1 ... )

Parallel Algorithms (G.1.0 ... )

Parallelism And Concurrency (F.1.2 ... )

Would you recommend this review?

yes

Other reviews under "Parallel Processors":	Date

Spending your free time Gelernter D. (ed), Philbin J. BYTE 15(5): 213-ff, 1990. Type: Article	Apr 1 1992

Higher speed transputer communication using shared memory Boianov L., Knowles A. Microprocessors & Microsystems 15(2): 67-72, 1991. Type: Article	Jun 1 1992

On stability and performance of parallel processing systems Bambos N., Walrand J. (ed) Journal of the ACM 38(2): 429-452, 1991. Type: Article	Sep 1 1992

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy