Computing Reviews

Data-intensive workflow management :for clouds and data-intensive and scalable computing environments
de Oliveira D., Liu J., Pacitti E., Morgan&Claypool Publishers,San Rafael, CA,2019. 180 pp.Type:Book
Date Reviewed: 09/16/20

Data-intensive workflows turn up in scientific domains where the most current information technologies find application areas. The “differentia specifica” between business and scientific workflows is the importance of provenance data rather than the volume of processed data; however, the huge volume of big data plays a significant role, and the efficient and effective handling of a huge volume of data is the central theme of the book.

Chapter 1 gives a survey about the structure of the book and motivating examples to comprehend the authors’ purpose.

Chapter 2 overviews the basic knowledge necessary to make the whole book understandable. It begins with a formal description of workflows and the relating standards, and discusses the properties of scientific workflow management systems (SWfMSs). The chapter reviews and compares existing solutions, and outlines the subject matter discussed in later chapters, that is, the scheduling problems and cost calculations of SWfMSs.

Chapter 3 discusses SWfMS performance issues and approaches in the case of a single-site cloud. The authors scrutinize scheduling algorithms and their cost models in a general sense, including execution time and financial costs related to cloud computing service providers.

Chapter 4 repeats the previous chapter’s investigation in a more complex environment, namely the execution of SWfMSs in a multi-site cloud. The performance problem can be formulated the following way: the intra-site, the inter-site, and the inter-communication between the sites should be taken into consideration by any scheduling algorithm and cost calculation model. In a multi-site cloud computing environment, the essential issue is the handling and distributing of data and metadata and not only the processing load among the sites and within a site among the servers. The activation scheduling of SWfMSs is a crucial performance issue to lessen the execution time. The scheduling problem causes difficulty if it has multi-objectives to consider, for example, total execution time and financial costs, thereby the authors look at how to reduce the execution time of the whole workflow. The efficient scheduling of activities within a workflow requires knowledge of the distribution of data; this information can be gained from the metadata and should be available for the multi-site activity scheduler. The chapter analyzes scheduling solutions for fine-grained and coarse-grained workflow executions. Fine-grained SWfMSs execute activities at different sites to work with the distributed data. In course-grained SWfMSs, the input data is not distributed among different sites; for that reason, it looks similar to single-site scheduling problems. The chapter contains a comparative study of the performance of the various scheduling algorithms.

Chapter 5 deals with data-intensive scalable computing (DISC) frameworks. The chapter discusses in detail the Apache Spark solution and the use of provenance data in this environment. One of the contributions of the chapter is the description of a provenance data server called SAMbA (to make a difference this acronym is used). SpaCE supports the fine-tuning of the more than hundreds of configuration parameters of the Spark system. The TARDIS system can be joined to the Spark system to create optimal scheduling for the activities of the workflows defined within the Spark system. The benefit of the application of DISC is that it yields in-memory data management that helps big data management quicken the data processing.

The book could be interesting for researchers and professionals working in various domains. Researchers and professionals interested in performance within cloud computing environments will find the cost and execution models useful. Researchers who investigate the different problem areas of workflows (modeling, managing, and model checking) can use the book as a good empirical starting point that contains good practices and workflow examples.

Reviewer:  Bálint Molnár Review #: CR147062 (2101-0002)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy