Computing Reviews

Principles of data integration
Doan A., Halevy A., Ives Z., Morgan Kaufmann Publishers Inc.,Waltham, MA,2012. 520 pp.Type:Book
Date Reviewed: 07/03/13

Dealing with different data sources is a common requirement for many software projects nowadays. Hence, a well-organized and thorough treatment of data integration topics is a welcome addition to the practicing software professional’s bookshelf. If that treatment is both academically rigorous and still readable, as is the case with this book, it becomes a valuable resource for researchers and, in particular, for doctoral students.

A short introductory chapter sets the stage by presenting the issues that make data integration a challenge. It also describes the architecture and main components of prototypical data integration systems. Once the stage is set, the core part (the first half of the book) delves into the foundational techniques of data integration.

The first few chapters are the most similar in style to traditional database systems textbooks. They cover how to describe data sources and manage queries in excruciating levels of technical detail, accompanied by algorithm pseudocode, theorems, and proofs. Plodding through these initial chapters, readers will become knowledgeable about query containment and equivalence, answering queries using views, executable logical query plans, and schema mapping languages (the formal languages that help describe the relationship between a pair of schemata, typically the mediated schema and the schema of one of the data sources).

After establishing the formal framework for data integration systems, the authors present four chapters on the data matching problems that practitioners must face when building actual data integration systems. These chapters address dynamic programming algorithms for string matching, the features of different schema matching systems, and alternative approaches for data matching. These chapters are followed by a short chapter on model management operators (algebraic operators that accept and return schemata and mappings), which provide the foundation for useful tools such as object-relational mappers. Academic rigor, however, should not be equated with dry theoretical discussions. For instance, you will also find practical information on scaling up data matching, a key requirement in actual data integration systems.

The core section ends with a handful of informative chapters on query processing, wrapper construction, and data warehousing. The authors follow a brief recap of database management system query optimization and distributed query processing with alternative designs for adaptive query processing, such as reoptimizing running queries in response to their changing execution environment. The discussion of alternative wrapper construction approaches includes technical details of some noteworthy systems such as RoadRunner and Stalker. Finally, a chapter on data warehousing presents caching and materialized views, which is not surprising given that a data warehouse can be interpreted as the materialization of the mediated schema in a data integration system. This chapter also describes the now ubiquitous MapReduce framework for the parallel analysis of “local, external data,” a subject that does not seem to fit with the rest of this part (although it is indeed important in practice for data scientists).

In some sense, the section on MapReduce serves as a hint to what you will find in the second half of the book: more pragmatic overviews on many different topics that are relevant to the practicing data integrator. This second half of the book covers “integration with extended data representations” and “novel integration architectures.”

Extended data representations include Extensible Markup Language (XML) and ontologies. The chapter on XML amounts to a clearly written tutorial on key XML standards: document type definitions (DTDs) and XML schemata for describing the structure of XML documents on the one hand, and XPath and XQuery for specifying queries on the other hand. The chapter on ontologies provides a nice introduction to description logics and a bird’s-eye view of the standards behind the semantic web. This part of the book also includes a couple of chapters on two desirable features of data integration systems: a cursory look at the representation of uncertainty, which occupies a whole subfield within artificial intelligence (AI) research, and a shallow overview of data provenance, also known as data lineage or data pedigree.

The final chapters of the book turn to novel integration architectures. The authors explore the deep web, the part of the web hidden behind web forms, and the use of keyword search in data integration systems. Decentralized systems are also analyzed, from a physical point of view in peer data management systems (PDMSs), such as peer-to-peer (P2P) in the database management system (DBMS) world, but also from an organizational point of view in collaborative data sharing systems (CDSSs), such as Wikipedia viewed from a database perspective.

This encompassing monograph on data integration concludes with a variety of examples, detailed algorithm descriptions, and informative bibliographic notes. Researchers looking for concise and clear descriptions of the state of the art in data integration will benefit from this noteworthy effort. Graduate students in particular will acquire an excellent blueprint of the field, supplemented by almost 600 up-to-date bibliographic references they can use to further their work.

More reviews about this item: Amazon

Reviewer:  Fernando Berzal Review #: CR141335 (1309-0783)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy