One of the dominant characteristics of the current computing ecology, the abundance of information in digital form, is a two-edged sword. On the one hand, it is increasingly likely that any information of interest exists somewhere online. On the other, the very mass of data being collected poses an increasing obstacle to searching, transmitting, processing, and exploiting it. This volume offers a concise handbook for using summaries of massive data to overcome this obstacle.
Chapter 1, the introduction, outlines the magnitude of the problem, surveying several domains that are generating such data. It then discusses a variety of data models that support summarization, including sets (recording which items are present or absent), multisets (capturing the multiplicity of each item), matrices, ordered data, geometric data, and graphs. It reviews the kinds of operations that a summary must support (initialize, update, merge, query, construct, and compress) and different processing models (streaming, parallel, and distributed). Since any summary loses some information, the chapter considers the kinds of guarantees that can be offered, and envisions in some detail a number of applications for data summarization. It concludes with an overview of the mathematical tools used in the subsequent chapters, with special attention to the Markov inequality and its use in the Chebyshev inequality and various forms of the Chernoff bound.
Each of the next six chapters offers several summarization techniques for a different kind of data (set, multiset, and so on). For each technique, the authors discuss the available operations, their complexity and available guarantees, history of development, and accessible implementations. Chapter 8 treats the special challenges posed by distributed information, chapter 9 considers some particular uses of summaries, and chapter 10 offers lower bounds supported by some summaries. Throughout, the book emphasizes high-level description of the methods with pseudocode, supported by clearly marked more formal discussion.
The schematic structure of the book, clearly explained in the introductory chapter, enables the reader to find the required information very quickly. A centralized bibliography lists 243 sources extending to 2019, and there is a short index.
This clear, easy-to-navigate volume will be of primary interest to software developers who need a quick guide to available techniques. However, the formal discussions, while marked “Further Discussion” so as not to distract the practitioner, provide additional depth for readers who want to go deeper, making the book useful to the student as well. The book is an excellent example of a focused tool addressing a specific but very pressing problem, and will become an essential handbook as our digital world becomes more and more congested.