The authors of this paper have done an excellent job in illustrating the various aspects of looping constructs that affect instruction-level parallelism (ILP) in the course of program execution.
The authors classify looping constructs into various classes, depending on their ability to be modified to support a higher level of ILP when subject to certain architectural changes (AC). They then evaluate the advantage of one AC over another in terms of their hardware cost and overall performance. This illuminates various issues that can emerge while considering a particular AC for a particular implementation.
The strength of this paper lies in the clarity with which the entire subject is delivered, including its numerous illustrations. The text is thus suitable for a wide readership. This paper would serve as an excellent tutorial on how to modify looping constructs to support a higher ILP.