The Cholesky decomposition of a positive definite matrix is the basis for many efficient and numerically accurate algorithms. The usual procedure is a variant of Gaussian elimination without pivoting, but the computation can be reordered in a variety of ways without affecting the accuracy of the result. As computer technology has developed, researchers have looked for ways to take advantage of the design of particular computer chips. For chips with high-speed cache memory or floating-point registers, there are advantages to viewing the matrix as an array of square sub-matrices or blocks, and recasting the algorithm as a computation on those sub-matrices. This paper, a condensed version of [1], describes implementations of level-3 procedures from the basic linear algebra subprogram (BLAS) package. The authors report tests of these implementations on different computer platforms.
One step of the block version of Cholesky requires in turn the Cholesky decomposition of each diagonal block. The authors have also recast this decomposition in block form with tiny blocks, 2-by-2 or 2-by-4. Then, they use inline code rather than loops, and matrix elements are loaded into scalar variables in order, as the authors say, “to alert most compilers to put and hold these scalars in registers.” They tested four routines, DPOTF3i, i=a,b,c,d, using different block shapes against other block implementations using only existing level-3 routines. The results show that programs using the DPOTF3i routines can attain higher Mflop/s rates for larger block sizes. The paper includes an extensive bibliography on block algorithms and packed matrix formats.