In the last decade, superscalar processors have been widely used in high-performance computing, and have even replaced vector architectures that were popular during the 1980s. The development of the Japanese Earth Simulator and the Cray X1 supercomputer has revived interest in vector architectures, however. In this context, this paper considers the differences in the performance characteristics of superscalar and vector architectures, and addresses the question of how programs tuned for superscalar platforms can be transformed into programs that are efficient for vector architectures. As execution platforms, an IBM Power4 and a Cray X1 are considered. The performance evaluation is based on a synthetic access probe program (Apex-Map) and several application programs (one-dimensional fast Fourier transform (1D-FFT), radix sort, Nbody simulation, and matrix multiplication).
Based on a detailed evaluation of the performance characteristics of the applications, the paper shows that the average vector length used and memory bank conflicts have the largest impact on the performance of the Cray X1. The memory size accessed and data reuse have a much smaller effect.
Using these guidelines for performance tuning, the authors were able to increase the performance of 1D-FFT and radix sort significantly. For the Nbody simulation, a performance gain of only 20 percent was obtained, mainly because of the irregular memory accesses that limit the potential for vectorization.
The paper is written for expert readers who are interested in the characteristics of modern vector architectures and their impact on program performance, as well as in the tuning of programs for vector architectures.