For computers dedicated to high-performance computing (HPC) applications, computation is virtually free, while data access shows up as the main cost factor. This iconoclastic view is even more valid when looking at graphics processing units (GPUs)--thousands of cores operating on what remain somewhat limited high-speed register files still call for external, host-based memories, which are at least a thousand-fold slower. One way to address this shortcoming is to replace the storing of intermediate values with the recomputations of the same values, when evaluating instruction blocks, thus trading cheap repeated expression evaluations for costly memory spills.
The main idea introduced in this paper is that this rematerialization of values can sometimes be more favorably performed via output values in lieu of input ones, at least as long as one uses reversible operators. Hence, after a register assignment such as r2:= r1+1, one could use reversible computing to recompute, for instance, expression 2 × r1 as 2 × (r2-1), which might enable the reuse of register r1.
The paper illustrates how this concept can improve both instruction scheduling and register allocation of scientific applications, taking as use case a lattice quantum chromodynamics (LQCD) physics simulation running on an Nvidia GPU. Rematerialization via reversible computing limits register pressure beyond traditional techniques, and this proves to be beneficial in practice; double precision, run-time performance is improved up to 10 percent.
Even though still preliminary, this work addresses a key compilation issue for HPC applications, and provides new ideas that should interest compiler writers and researchers working on program transformations.