Addressing the problem of performance while processing big data, the authors propose a new programming model, named epiC, that deals with the problem of data variety, for example when data arrives at the evaluation engine as a mix of structured, semistructured, and unstructured data.
Existing approaches, for example, Apache Hadoop and Pregel, are focused on coping with such big data issues as high volume of text data or graphs, respectively. When dealing with data variety, the usual approach is to set up a shared-nothing cluster where a portion of the cluster is dedicated and tailored to Hadoop, another for Pregel, and so on. Of course, this approach poses several problems such as how to deal with and manage intermediate results and how to manage faults in the cluster’s working nodes.
The authors propose epiC, an actor-based framework that, leveraging existing implementations for, for example, MapReduce, facilitates big data processing with a high degree of data variety. The actor model (first introduced in a 1973 paper [1]) is a mathematical model that is composed by distributed components capable of a limited set of actions, avoiding the need for locks. The authors made benchmarks with other frameworks and ontologies, and showed improved performance.
By adopting certain implementation details, epiC fits perfectly in the gaps left open by off-the-shelf implementations, by also providing an elegant fault tolerance mechanism.
The actor model is based on messages. Slave nodes only exchange metadata with master nodes; thus, the overhead of the communication infrastructure is really low. Computations and intermediate results are stored in a distributed file system (DFS), through an optimized storage mechanism to avoid input/output (I/O) delays.
Concluding, the paper presents a solution to those who need to tackle the issue of data variety, but not only that: the methodology used in the development is a detailed example to those who want to start developing following the actor-model paradigm.