Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Exploiting hierarchical locality in deep parallel architectures
Anbar A., Serres O., Kayraklioglu E., Badawy A., El-Ghazawi T. ACM Transactions on Architecture and Code Optimization13 (2):1-25,2016.Type:Article
Date Reviewed: Aug 26 2016

Locality awareness in programs can be used to improve their execution performance on parallel computers. Modern parallel computers have many levels of parallelism; many cores on a chip and many chips in a node are examples. Locality awareness is the affinity of parallel threads to the distributed data. The paper describes a runtime system that takes locality awareness information expressed in the program and maps it to a multilevel parallel computer to improve execution times of the program. It describes the internals of the system and performance improvement results of various experiments.

The paper describes the parallel hierarchical locality abstraction model (PHLAME). The first section presents bandwidth graphs for a modern parallel computer, showing bandwidths at various levels of its organization. The motivation of the work is to reduce communications among the threads executing on different cores/processors. The second section gives a survey of similar projects in the past few years.

In the third section, the internals of the PHLAME runtime system are described. The PHLAME implementation model consists of a locality-aware programming model, a mappings evaluation mechanism, mapping strategies, descriptive models, and a runtime systems mapping. The formalism contains a machine description, application profiling, fitness of integrating threads, partitioning of thread interaction graphs, and a PHLAME adaptive selection test algorithm. Partitioning algorithms such as clustering, restricted splitting, and nonrestricted splitting are described. The adaptive selection algorithm quickly chooses the best partitioning algorithm. The next section describes the performances of these algorithms on a real-life multilevel parallel computer. The benchmarks chosen for performance measuring experiments are network-attached storage (NAS) parallel benchmarks written in message passing interface (MPI) and unified parallel C (UPC). The last section discusses the improvement in performance.

The performance gains are shown to vary from two to 80 percent, which means that this optimization approach can be useful along with other optimizations. The paper claims it is unique because it extracts multilevel communications improvement from single-level locality-aware programs. This means that resources in rewriting existing algorithms can be saved. On the other hand, the approach opens a new application for graph partitioning algorithms and packages. The formalism in the paper is dedicated to explaining the setup of the approach.

Reviewer:  Maulik A. Dave Review #: CR144712 (1611-0822)
Bookmark and Share
  Featured Reviewer  
 
Processors (D.3.4 )
 
 
Concurrent Programming (D.1.3 )
 
Would you recommend this review?
yes
no
Other reviews under "Processors": Date
The IBM family of APL systems
Falkoff A. IBM Systems Journal 30(4): 416-432, 1991. Type: Article
Dec 1 1993
Attribute grammars: attribute evaluation methods
Engelfriet J., Cambridge University Press, New York, NY, 1984. Type: Book (9780521268431)
Jun 1 1985
Processor control flow monitoring using signatured instruction streams
Schuette M., Shen J. IEEE Transactions on Computers 36(3): 264-277, 1987. Type: Article
Dec 1 1987
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy