Gueron introduces an algorithm to speed up the CRC32C implementation using the corresponding central processing unit (CPU) CRC32 instruction from Intel to compute cyclic redundancy checks of an arbitrary length buffer with inputs ranging from eight to 64 bits (powers of two).
The paper is a perfect illustration of classical pipelining techniques and throughput-oriented speedup, which are still pertinent today. Figure 3 is a classic. These techniques are taught to undergraduate students in computer science, software, and computer engineering during their system hardware, system software, and assembly programming courses. Thus, the paper has a lot of educational value where classical techniques are used in actual industrial practice. I strongly believe the content of this paper should become a part of a textbook on assembly and computer organization.
The paper discusses how to speed up CRC32C computations by a factor of at least three with the Intel instruction, a significant improvement over sequential and previous results. A speed increase like this can be applied to remote direct memory access (RDMA) and the Internet small computer system interface (iSCSI) protocol stack, where CRC32C computation can be a bottleneck when doing integrity checks on communicated data transfers.
The author further improves speed using the hardware CRC32 instruction (capable of processing up to eight bytes with a latency of three CPU cycles) instead of a serial instruction, by logically splitting the input data buffer into three disjoint chunks, processing them with CRC32 in a pipelined manner, and then recombining the results. This fully utilizes the CPU’s pipelined hardware, which Gueron makes throughput oriented rather than latency bound.
The author provides very good, easy-to-read background information, bit arithmetic, and theoretical foundations, and a detailed illustration of the instruction and algorithms. The paper concludes with performance results.