The growing gap between microprocessor and memory performance has made cache prefetching a key research issue in recent years. Vanderwiel and Lilja survey data prefetch mechanisms for both single-processor and multiprocessor architectures. The first part of this work gives the necessary background for people who want to study cache prefetching. For its completeness and clarity, this part is strongly recommended to researchers who are beginning their work on cache prefetching. The introduction outlines the ideas underlying prefetching methods. The authors also point out the drawbacks of an incorrect prefetching policy, including cache pollution and unnecessary prefetches.
The second part of the paper describes the known efficient and effective prefetch mechanisms. Single-processor architecture is considered first, and the approaches are classified into software and hardware data prefetching. Software data prefetching is mainly based on the manual or pre-compiler-driven inclusion of special instructions that force the cache to fetch a given datum at a given time. The drawbacks of this approach include difficulties in prefetch scheduling, increased register pressure, and degradation of the instruction cache performance (due to the extra fetch instructions). The authors subdivide hardware data prefetching into sequential prefetching (commonly known as one-block lookahead) and its variants, and prefetching with arbitrary strides. A clear and understandable overview of the latter approach is given. Published solutions that combine the advantages of both these approaches are also reported. Finally, the additional problems implied by multiprocessor architectures are listed and discussed.
This paper is useful, well-structured, and clear. It covers most of the topics in cache prefetching, and gives helpful suggestions about what must be prefetched into the cache, when, and where. Nevertheless, some recent research results are not included in this survey, mainly because of their later publication dates. The advantage of using different prefetching methods depending on the locality [1, 2], and the improvement to applications that demand a lot of data (such as multimedia applications) when using specialized prefetching methods [3, 4], are not mentioned and probably were outside the scope of this work. Finally, the paper does not discuss suitable metrics for cache performance evaluation, either in terms of reducing cache misses or in terms of time saved.