Memory access remains a pivotal element for eliciting the maximum performance from increasingly powerful parallel computers. Parallel computing memory resources are typically divided into shared and global memory spaces with tradeoffs between access time and bandwidth similar to cache hierarchies. They are common in today’s graphics processing units (GPUs), and algorithm designers need to make careful programming choices to optimize performance. This paper presents two memory models, discrete memory machine (DMM) and unified memory machine (UMM), to model the shared and global memory spaces of GPUs.
The author models contiguous and stride access on DMMs and UMMs separated in memory banks. Based on these principles, Nakano describes an implementation of matrix transpose algorithms for the proposed models and evaluates their performance, considering other parameters such as number of processors and latencies.
In essence, these two models are akin to abstract shared memory models for nonuniform memory access (NUMA) and uniform memory access (UMA), and a section on the analogies between them could have been beneficial to readers. Overall, these simple yet effective models could prove to be useful for designing and comparing algorithms from a more structured and abstract level.