The authors are concerned with the problem of loops over large matrices, in which matrix elements are accessed and modified via nontrivial dependency relationships. That is, a given element may be needed more than once, and/or modified more than once, in a way that may not be immediately obvious from the source code.
The paper introduces a primitive flow analysis such that when data dependencies arise, they are more likely to be bound to the same processor, so interprocessor communication is reduced. Experimental evidence is offered to show the value of the given approach.
The algorithm seems to be useful for fairly simple and regular matrix operations (which, of course, are the ones that usually arise). The level of generality is not clear. When generating an iterative algorithm from scratch, it is probably best to avoid local dependencies, even though they may appear to be more efficient from a sequential viewpoint. Compiler optimizations are likely to overwhelm the gains of human-generated sequential operations. However, for “dusty decks,” the introduction of intermediate parallelism in this way would seem to be valuable and worth pursuing.