At the innermost core of many algorithms in scientific computing is the task of summing up many floating-point numbers. Therefore, it is important to have algorithms that can handle this problem fast and accurately. A key issue in this context is the propagation of rounding errors. Zhu and Hayes’ algorithm, OnlineExactSum, presents an impressive solution to this problem.
As its name suggests, it is possible to prove that the algorithm computes the correctly rounded value of the exact sum of the input data. Moreover, the input data stream can be arbitrarily long, whereas the algorithm only needs to see one input at a time. The only restriction imposed is that the intermediate results are not allowed to contain overflows. Under this assumption, the authors prove the correctness of their method and demonstrate that it runs very fast; it only requires five floating-point operations per summand. The use of instruction-level parallelism then leads to a runtime that exceeds the potentially highly inaccurate ordinary recursive summation (which requires just one floating-point operation per summand) only by a factor in the range of two to three.
The method is based on HybridSum, the authors’ previous algorithm [1], which is based on an idea of Demmel and Hida [2] to use extra precision accumulators to sum floating-point numbers with identical exponents, and to provide a fast and accurate method (iFastSum) to compute a correctly rounded sum of these accumulators. The new idea is to double the number of accumulators, thus avoiding the necessity to split the input data. The algorithm is not only fast, but its memory requirements are also very modest and, in particular, independent of the length of the input stream.
A nice side effect of the fact that the method always produces the correct result is that it always produces the same result, no matter the order in which the summands are processed. This is important, for example, in parallel computations such as matrix/vector multiplications or scalar products, where each processor handles a part of the vectors or matrices under consideration and different runs of the program with identical input data may, due to different workloads of the processors, lead to intermediate results arriving out of order. Less sophisticated algorithms will then lead to rounding errors being propagated in different ways and, hence, to the loss of reproducibility of the results.
Readers will find that this algorithm has many applications in the field of scientific computing.