Cloud computing is becoming popular for computing that requires large processing power and massive data volume. The work reported here aims to study scientific workflows on Amazon Elastic Compute Cloud (EC2). The principal goal is to study runtime performance and cost analysis of three real workflow applications from diverse domains, as well as to study the resource requirements.
The workflows are “loosely coupled parallel applications” that computational tasks handle via data flow and control flow dependencies. Unlike the tightly coupled systems where tasks communicate through networks, workflow tasks typically communicate through files that are created by source task and sent to (or shared by) the destination task.
The applications are from astronomy (Montage), seismology (Broadband), and bioinformatics (Epigenome), and typically run 100s to 1000s of tasks requiring many gigabytes (GBs) of read/writes. These are deployable fully (or partly) in the cloud; however, in this case, the submit host was outside the cloud to manage workflow and worker nodes, while storage was inside the cloud. With the submit host outside, it is easier to deploy workflows and setup, and log data will not be lost. This also becomes a permanent base for access and control, and the cost of data transfer is far less. However, being inside the cloud improves performance. A virtual cluster is used as an execution environment, with c1.xlarge instance type nodes. The findings follow.
There is a performance benefit to the Montage application due to the presence of many small files. Broadband shows the best overall performance, and Epigenome performed better at central processing unit (CPU) bound jobs but performed poorly in input/output (I/O) compared to the other two. Looking at “three different cost categories,” that is, resource, storage, and transfer costs, the performance versus cost relation was not linear for the most part, and there were increasingly lesser payoffs for cost in terms of throughput. However, the general pattern was found to be consistent: more resources, more cost. The average reduction in performance with an increase in number of nodes was attributed to the limited scalability of the applications. Epigenome showed “the least benefit” due to fewer jobs in the workflow, whereas Montage had the greatest benefit due to “having the most jobs and the most traffic between the submit host and the [nodes].” The performance study was largely empirical, with no mathematical relation outcome.
Over and above being a detailed and exhaustive empirical study, the paper can be an informative tutorial introduction to deploying applications on the cloud. It is recommended for graduate and undergraduate students.