If there is a topic that has been overhyped, it is big data [1]. A confluence of trends has made it possible to start working with previously unimaginable data in all its volume, velocity, and variety. For most, we don’t actually have big data, but we still benefit from the new approaches. The analytics that Louridas and Ebert are highlighting is the blending role of a data scientist, part programmer and part mathematician or statistician, and the supporting tools.
Louridas and Ebert do a great job framing the challenge the world faces with the ever-increasing amount of data, the need to make it meaningful, and the emergent skills and tools required to do so. Concrete examples using data from the World Bank are provided. They essentially propose an analytics stack: Data-Driven Documents (D3), Python, and R, which matches my experience in practice. Python offers a real programming environment, and R is a true statistical package that combines programming and visualization into a working environment. There are a number of high-quality routines available for R. Finally, D3, used for the visualization portion of stack, generates something an end user can consume. These three aspects, programming, statistics, and visualization, are critical to a complete data analytics platform.
One challenge is that many of the tools that deal with really big data such as distributed file systems and managed execution, like Apache Hadoop, are not written to effectively work with the numerous tools available. Still, it is likely that Hadoop would be part of the overall solution (the mother ship of offline analytics). Unfortunately, most examples deal with flat files when in fact most systems of any complexity are working with databases, feeds, and files, and at different velocities (for example, end-of-day feed to real-time sensors). Really big data requires a foundational infrastructure to tackle the largest scale. Since most of us do not actually have big data, this may be moot. What is not controversial are the points Louridas and Ebert make. One spends as much time preparing data as analyzing it, and optimization is still likely to come from the software engineering side. All I would add is to make sure you have data scientists leading the analytics [2].