The already high number of transistors that can be etched onto a silicon wafer continues to grow, enabling computer architects to lay out more electronic circuitry, and, therefore, more functional units, on one computer chip. This has led to the emergence of a variety of new computer architectures, such as the superscalar architecture, very long instruction word (VLIW) architectures, multiple-context processors, and multiple processors on a single chip. A variation of the multithreaded processor architecture, presented in this paper, shares the heterogeneous functional unit set among several instruction pipelines of different threads. This contrasts with the more centralized scheme in which each instruction pipeline is served by a set of dedicated functional units.
This paper begins with an introduction to the proposed Concurro architecture, a 64-bit, register-oriented, load/store instruction set processor with additional support for multithread control and fine-grained synchronization. After describing an execution-driven simulator and six benchmark programs, Gunther concentrates on the performance of the Concurro processor and on comparisons with other multiprocessor architectures in terms of instruction cache organization (single versus private), dispatch strategies (superscalar versus out-of-order), and hardware utilization. The paper benefits of maintaing a balance between the performance gain of multithreaded multiprocessors and the relatively high implementation and programming costs of such an architecture. This is the main motivation behind distributing functional units among multiple instruction streams.