Big data analyses provide associations among large datasets for probing business trends, averting diseases, connecting legal citations, fighting crimes, and finding out instantaneous highway traffic flows. Unfortunately, most statistical and visualization packages have difficulty processing big datasets that require massively parallel software running on numerous servers. How should large and complex collections of datasets be captured, stored, searched, shared, analyzed, and visualized?
In this paper, the authors concisely present the current and future locality problems of big data, and recommend concepts for intensifying the theory of locality for applications in a variety of big data domains. A footprint is the quantity of discrete data retrieved in a window of execution by a central processing unit. The computation of a footprint is a big data problem, because it is time consuming to count the number of distinct data in each substring of large size window traces.
The authors recognize the open-ended precision of the approximate and sampling methods for solving the footprint problems in the literature. They present locality metrics of footprints, reuse distance, and miss rate. The reuse distance for every cache memory access is the total discrete data used at a time since it was previously accessed. The miss rate of a footprint is the overall portion of the reuse distances of a cache size beyond its size.
The authors graphically illuminate the average footprint function, miss rate curve, and the reuse distance profile. Hardware designers and programmers could take advantage of these graphs to measure and improve the use of cache by specific programs. The concepts of actively shared data and footprint sharing ratio discussed in this paper would be valuable for recognizing incorrect shared cache, designing shared cache, and multithreading.