This research presents a VHSIC hardware description language (VHDL) vector processor architecture specifically designed to address data-level parallelism by separating the vector lanes to use its own private memory, avoiding any stalls during memory access instructions. Several vector processors have been developed to date, and some are even targeted for specific applications, such as matrix multiplication.
This particular research would be a target for an application like the fast Fourier transform (FFT), which requires data shuffling. This architecture addresses the vector processors’ slow memory accesses, which are not addressed by prior research in vector processors; the data shuffle instructions are supported by a shuffle engine in each lane, which is placed after the lane’s local memory, connected to the common bus.
Thus, there often is a memory contention in these architectures. This type of architecture is inefficient due to high area consumption and energy consumption compared with multicore architectures. In order to evaluate this new architecture, a few benchmarks such as matrix multiplication, FFT, and finite impulse response (FIR) filtering were chosen. Speedups were shown for each benchmark in terms of execution time. The vector processor achieves a maximum speedup of about 1,500-fold when the RGB2YIQ benchmark was evaluated.
Design issues and scalability related to vector processors can be addressed by maximizing performance. With vector processors, the main bottleneck comes from low issue rate for all of the vector processors in spite of the vector instructions in the architecture. Vector processors that have been researched extensively in order to maximize their performance issues for a particular application may be feasible.
However, for general-purpose architecture, will it become viable? Perhaps not, because the total area and power consumption of such processors could be high. In the future, such variable vector processors may become mainstream, but this is questionable due to the complexity in nature, area consumption, and so on.
Continuing research in the vector processors area must consider future feasibility and mainstream usage. Even though this architecture presents a 1,500-fold speedup for a particular benchmark, is it suitable for mainstream architecture?