When a central processing unit (CPU) is augmented by a graphics processing unit (GPU), compute-intensive functions can be offloaded to the GPU. While a CPU typically contains a few cores, current generation GPUs contain over 2,000 cores that can operate in parallel. GPUs have been widely applied to accelerate computations in engineering, genetics, and many other disciplines, but have not been applied extensively to database applications. This paper documents a successful open-source application called Mega-KV (http://kay21s.github.io/megakv/) that uses commodity personal computers (PCs) and GPUs to accelerate the performance of a very important application: in-memory key-value (IMKV) stores. To report their results, the authors use the open-source IMKV store MICA [1] for comparison with Mega-KV because MICA is the CPU-based IMKV store with the highest documented throughput. Mega-KV, running on two off-the-shelf CPUs and GPUs, was 1.4 to 2.8 times as fast as CPU-based MICA.
The major challenges of using GPUs in this case are: (1) limited GPU memory and slow transfers between the CPU and GPU; (2) finding a design point that balances transfer size (larger transfers mean higher latency) with throughput (smaller transfers mean less utilization and less throughput). Rather than directly porting [that is, re-coding in compute unified device architecture (CUDA)] a known IMKV store like memcached to a GPU, the authors developed a custom optimized solution that carefully considers the capabilities of GPU architectures. They studied possible techniques separately in multiple testbeds and chose design points (such as transfer size) before they combined techniques into an overall approach. Their study identified two main issues with previous IMKV approaches: the poor match of index operations to GPU architectures and the unpredictability of operation scheduling that does not distinguish among different types of operations. The techniques they employed to overcome these issues included the use of cuckoo hashing, selecting the best number of threads in each of the multiple processing units they defined within the GPU, and careful scheduling of batches so that, as desired, GETs execute faster than SETs.
The paper documents the authors’ design techniques well and is of value to anyone wanting to create custom nongraphical CUDA software for a GPU.