Parallel computation is everywhere, from small portable devices to laptops, desktops, and data centers. This wealth of available parallelism challenges computer scientists and developers alike to rethink algorithms and rewrite old sequential source codes. However, we still have a long way to go. This paper offers a deep analysis on a different way to address the challenge. The authors investigate the performance limits of dynamic binary parallelization (DBP), a technique for transparently parallelizing single-thread binary executables.
The success of instruction-level parallelism on single-core architectures with multiple arithmetic logic units (ALUs) motivates a higher level of automatic parallelization on multi-core architectures (the DBP). In the paper, the authors first analyze this new proposed architecture by explaining how sections of binary code are identified for parallel execution, and then present their experiments and extensive results.
The key idea of DBP is to reduce the number of critical path instructions by overlapping execution segments of the instruction stream. The overlapping is done by speculative cores launched by a master core responsible for the correct execution of the original binary. However, this strategy may result in performance reduction if many speculative threads are invalidated, mainly due to data dependency between threads.
The experiments address various applications of a benchmark suite. The results, from an eight-core reduced instruction set computing (RISC) processor over a single-core execution, show an interesting 54 percent reduction in the number of instructions on critical paths. This is associated with a mismatched average speedup of 1.43 times. This is still far from the ideal performance gain, which suggests that the main bottleneck is bandwidth rather than instruction processing.