
What attracted me to this book was its title. With the massive growth of data that spans many different types, including multimedia, and runs into petabytes and beyond in volume, the effective management of data is now a major concern. Data engineering should span all of the following dimensions: acquisition, cleaning, organization, analysis, retrieval, and visualization. A book on data engineering that consolidates knowledge on these dimensions, from various disciplines, would be very welcome indeed. Unfortunately, this book was disappointing from almost the very first chapter.
Despite its attractive title, the chapters are inspired by work taking place at the Acxiom Corporation (or in collaboration with it), which the blurbs and preface mention. Although not every chapter is by an author from this company, it seems as though the Acxiom Corporation strongly influenced the selection of topics in the book, leading to an unbalanced perspective of data engineering.
First, a quick look at the contents. As the preface states, the book spans four major areas, starting with data integration and information quality. The first chapter provides a very brief overview of this topic. Chapter 2 looks at entity resolution, which is a major concern in data analysis, particularly for textual data. The next two chapters are on computing transitive closure of data records and semantic data matching. Chapter 5 looks at spelling error correction with edit distance, and chapter 6 is on parallel data generation.
Grid computing, the second theme, begins with chapter 7’s introduction of a grid environment for customer data integration. The next three chapters are on parallel file systems, performance modeling of enterprise grids, and delay characteristics of packet-switched networks.
Chapters 11 to 14 are on data mining, the book’s third major topic. These chapters cover concept-association mining, document structure discovery, a framework for table abstraction, and an information quality framework.
The book’s last main theme is data visualization. Chapters 15 to 18 cover the interactive visualization of high-dimensional datasets, image watermarking, cellular structure visualization, and geospatial intelligence. The book ends with chapter 19, which considers the futures of these four areas.
Many of the book’s topics do not have a very strong connection to data engineering, particularly given the many important topics not included and the shallow coverage of the topics that did make it into the book. While the four main areas are quite relevant to data engineering, the individual chapters fail to do the topics justice. The opening chapter is missing a thorough overview of data engineering as a field, to place the rest of the book and associated chapters in context. None of the chapters even attempt to link back to the topic of data engineering. Chapter 7, “A Grid Operating Environment for CDI,” has very little to do with customer data integration (CDI). Similarly, chapter 13, “Designing a Flexible Framework for a Table Abstraction,” is mostly about design patterns and software engineering. All of the chapters are largely disconnected from each other.
Most of the chapters are just a description of some work that has been done in the field. In some cases, the work is of an elementary nature--for example, the spell checker and the data generator. In some cases, the work is relatively more advanced, but there is no reference to the state of the art in the relevant topics. Generally, the chapters lack conviction. While some of them do try to explain some relevant background, neither the coverage nor the final choice of approach are convincing. Mostly, the chapters go into a lot of detail about the algorithms implemented by the authors and report some results. Generally, no effort is made to compare the results with other approaches or literature, and no rationale is provided as to why that particular approach was taken. Adequate details of the studies, so that the reader can experiment with the algorithm or the approach, are not provided, nor is the software in most cases. In the absence of these, the detailed description does not seem to make much sense.
Each chapter includes some exercises. Most of them are simply recall questions of a very shallow nature, closely tied to the approach described. The intended audience includes academia and researchers.
Given its wide range of topics, many of which are not closely tied to data engineering, and the lack of academic rigor, the book will provide a misleading picture of the field. In conclusion, despite its very good topic, this is a book you can do without.