The basic linear algebra subroutines (BLAS) consist ofthree libraries (known as Levels 1, 2, and 3) and form an integral partof much of the important numerical software developed over the last twodecades. Efficient implementations of these libraries often lead in turnto large gains in the efficiency of higher-level routines, such as theLapack library of linear algebra software. Many vendors supply versionsof the BLAS tuned to a particular platform, and these are often handcoded to extract the best performance from the target hardware.
The hierarchical memory organization common to many current systemsmakes the development of efficient,platform-specific Level 3 BLAS (which performmatrix-matrix operations) especially challenging and expensive. Theauthors of this pair of papers show how it is possible to produce anefficient BLAS Level 3 library based on highly optimized versions of thesingle Level 3 routine that performs a general matrix-matrix multiply(GEMM) and a small number of simpler Level 1 and Level 2 routines. Theyalso provide a model implementation of their GEMM-based routines alongwith comprehensive benchmarking software that allows users to measurethe quality of vendor implementations against an efficient, portable,Fortran 77 version of the library.
The papers provide a detailed account of the strategies required tosqueeze the last drop of efficiency from today’s processors. Althoughtargeted at the matrix-matrix multiply, the lessons and techniquesemployed will be valuable to anyone wishing to obtain similarperformance from other higher-level numerical operations.