The Pegasus workflow management system distinguishes itself by focusing on the automation of really large (for example, millions of steps) and purely computational processes executing in distributed and heterogeneous environments (for example, local servers, high-performance clusters, and grids), typically for the scientific community but not limited to it.
A second discriminating feature is the fact that Pegasus uses two types of process descriptions: an abstract description of the process to be automated in the form of a directed acyclic graph (DAG) where nodes correspond to user-supplied programs to be executed or to arbitrarily complex sub-processes, and where edges determine the logical flow of the computation and which input/output (I/O) files [sic] the nodes expect or produce.
A separate mapper component of Pegasus optimizes this abstract process flow and transforms it into an actually executable process description capable of being enacted on various workflow engines, for example, HTCondor. This stage uses several catalogs to resolve the purely logical references to nodes (programs) and I/O files to concrete computing resources and data access facilities (from a simple file system to Amazon’s S3 cloud storage).
This approach ensures that domain experts who are not information technology (IT) experts are able to formulate a potentially very large computational problem in a simple way. The Pegasus system itself takes care of optimizing and allocating the process to actual computing nodes, and manages and monitors the execution of the process (potentially consuming millions of wall-clock hours of computation).
The authors explain the breadth of Pegasus to a potentially nontechnical audience. I recommend the paper to any domain expert who is looking for a system to automate really large and genuinely computational processes (that is, without any intermediate user input) in complex IT environments.