Real-time operational tools for the mining of massive datasets should provide fault tolerance, scalability, and support for complex query processing in relational database management systems. Data mining analytical tools that operate in a clustered environment such as Hadoop can preserve data integrity during a breakdown and support the scalability of query processing in nonrelational database systems. But how should the multifaceted semantics of queries in structured query language (SQL) be accurately translated for optimal processing in nonrelational systems? Amirthalingam and Springer investigate this nontrivial question by exploring the effectiveness of a “correlation-aware SQL-to-MapReduce” translator. They design experiments for recognizing and optimizing the relationships among multiple SQL queries processing in environments that support parallel and distributed transaction processing.
The authors used a Hadoop cluster of concurrent hardware mappers and reducers with software translators to experimentally investigate the effects of translating multiple query optimizations and processing from relational to non-relational data processing environments. The experiments investigated (1) the relationships between the translation and execution times of simple SQL queries for various data sizes, finding results that depict a linear increase in the processing times by MapReduce to execute queries as the data size intensified; (2) the capability of the MapReduce translator to enhance complex query processing of different dataset sizes from reliable databases recommended by the Transaction Performance Processing Council, with the experimental results showing that the processing time of complex queries increased with an increase in the dataset sizes; and (3) the sensitivity of the SQL-to-MapReduce translator to the relationships among multiple queries, which revealed a positive correlation between the execution times and the number of nontrivial translated queries.
I strongly recommend that all big data analysis professionals read the valuable and practical ideas in this paper. The translation of queries in the first normal form (1NF) relational model for processing in distributed environments such as Hadoop might be easy, but how should queries for datasets organized in higher normal forms be translated to capitalize on the current research results? Without a doubt, the authors have opened up new practical research ideas in the area of data mining algorithms.