Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Parallel programming model for the Epiphany many-core coprocessor using threaded MPI
Ross J., Richie D., Park S., Shires D. Microprocessors & Microsystems43 (C):95-103,2016.Type:Article
Date Reviewed: Aug 29 2016

This particular parallel programming model using threaded message passing interface (MPI) was implemented on a special architecture, the Adapteva Epiphany. The Epiphany IV architecture is comprised of 64 cores arranged in 2D tiled mesh network-on-chip (NoC) RISC with minimal functionality. The authors run threaded MPI on four different benchmarks--dense matrix-matrix multiplication, N-body particle interaction, five-point 2D stencil update, and 2D fast Fourier transform (FFT); according to the authors, results show “high computational energy efficiency [and scalability] for both integer and floating point calculations.”

Since this research uses a threaded MPI, each “device must be accessed as a coprocessor and each core executes threads [with] a highly constrained set of resources,” where threads “will then employ conventional MPI semantics for inter-thread communication.” The matrix-matrix multiplication performance result achieved 74 percent (about 12.02 GFLOPS versus 16.2 GFLOPS) of the peak performance seen in [1], using the Epiphany IV architecture. Using the OpenCL kernel, the implementation obtained a peak performance result of 16 MFLOPS for the 128 x 128 scenario. The threaded MPI showed a much higher performance where the arithmetic intensity is high in an application and “by addressing the inter-core communication with non-blocking communication schemes to overlap computation with communication.”

For the N-body application, the highest performance of 8.28 GFLOPS (43 percent) of peak performance was measured. Similarly, five-point 2D stencil application had the highest measured performance of 6.35 GFLOPS (33 percent) of peak performance, which was the lowest due to the effectively low inter-core bandwidth involved in this program.

Finally, for the 2D FFT program, the peak performance of 2.50 GFLOPS (13 percent) was measured, which was the lowest among all four benchmarks used. Every application has its “computation-to-communication ratio so that performance is mostly limited by communication.”

As a result, this study demonstrates that the threaded MPI programming model is suitable for the Epiphany architecture and is “capable of achieving high performance and efficiency for a range of parallel applications.”

The Epiphany architecture is a 2D mesh tiled architecture that can be more suitable for a shared memory model. Since threaded MPI is used, it takes advantage of using threads to access related data in a shared inter-core model. Thus, the communication cost would be relatively low and can achieve high performance. Perhaps the OpenMP model, which is more suitable for a shared memory model, should have been compared using these benchmarks. The authors state that this architecture is comparable to graphics processing units (GPUs) and other recent processors. It would have been nice to compare these benchmarks to existing GPUs and recent architectures with more cache functionality to reduce the access time involved in inter-core data movements. A similar threaded MPI model is used in distributed computing; the communication cost would make it unsuitable for the MPI platform if inter-core communication were involved. However, combining MPI and OpenMP for shared memory access might prove to be suitable using an architecture like Epiphany. MPI use, when there is not much communication involved while passing data between inter-cores, can achieve more parallelism like the shared memory architecture. Not using cache memory would conserve energy, making it energy efficient. However, energy efficiency is not discussed much in this paper and the scalability issue should have been presented clearly.

Reviewer:  J. Arul Review #: CR144715 (1611-0804)
1) Varghese, A.; Edwards, B.; Mitra, G.; Rendell, A. P. Programming the Adapteva Epiphany 64-core network-on-chip coprocessor. In Proc. of the 2014 IEEE 28th International Parallel & Distributed Processing Symposium Workshops. IEEE, 2014, 984–992.
Bookmark and Share
  Featured Reviewer  
 
Interconnections (Subsystems) (B.4.3 )
 
 
Design (B.5.1 )
 
 
Multiple Data Stream Architectures (Multiprocessors) (C.1.2 )
 
Would you recommend this review?
yes
no
Other reviews under "Interconnections (Subsystems)": Date
RS-232 made easy: connecting computers, printers, terminals, and modems
Seyer M., Prentice-Hall, Inc., Upper Saddle River, NJ, 1984. Type: Book (9789780137834723)
Jul 1 1985
Complete guide to RS232 and parallel connections: a step-by-step approach to connecting computers, printers, terminals, and modems
Seyer M., Prentice-Hall, Inc., Upper Saddle River, NJ, 1988. Type: Book (9789780137835157)
Jul 1 1989
Build your own computer accessories and save a bundle
Hargrave B., Dunning T., Windcrest/McGraw-Hill, Blue Ridge Summit, PA, 1992. Type: Book (9780830638666)
Jun 1 1993
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy