The WaveScalar architecture aims to take advantages found in dataflow to achieve more parallelism with less silicon area, along with better performance and scalability than superscalar architecture. The program counter, which is a bottleneck in von Neumann architecture, has been eliminated, and the natural parallelism found in dataflow architecture is combined with fine-grain multi-threading to achieve a linear speedup when used with several processing elements. Instructions are ordered according to the path to be stored in a WaveCache, which is designed for WaveScalar architecture in a tiled fashion. In this architecture, deep speculation is done naturally. To alleviate the cache coherency, directory-based MESI is used.
One novel idea in this architecture is that the WaveCache loads instructions from memory, and assigns them to processing elements instead of processing elements requesting certain instructions to process. Besides, the WaveScalar architecture does not use a register file for intervening accesses, but directly stores the required instructions.
In previous dataflow architectures, even the dynamic ones, the matching of operands was a bottleneck, and reduced the performance. In this architecture, the authors claim that the logic of match and dispatch is the most complex. What impact will it have for the WaveScalar architecture? Will it be a bottleneck, as it was in the past? These things need to be addressed clearly.
The WaveCache for this architecture is designed rather well. The compiler for this architecture is yet to be finished--it is rather tedious work. Will there be a linear speedup with the compiler for this particular architecture? This is yet to be evaluated. A hand-coded version would be naturally optimized. Is it possible to compare the performance of WaveScalar architecture with TRIPS, which takes a similar approach, but with slight variations? There is also a stream processor with clusters of processing elements, which uses local registers and global registers that try to take advantage of the von Neumann model without including dataflow concepts. Is it possible to compare the performance of the WaveScalar architecture to a stream processor? Since WaveScalar does not include any register file for intervening accesses, will it have a positive or negative impact on the overall performance of this architecture?
The authors try to bring dataflow concepts closer to von Neumann, but with less wire delay, and less silicon that can perform in a single-threaded or multithreaded version better than the superscalar architectures. The paper is rather long; the simple concepts of the von Neumann model and dataflow could have been omitted, since anyone reading this paper should know the basics in both realms.