Practical implementation of a data lake is a puzzling book. Full of buzzwords and acronyms, it rarely provides their explanation. Writing is colloquial to the extreme, similar to the transcripts of informal discussions between experts, or to well-kept class notes, rather than to anything approaching a style expected for a book on technical management. This approach initially provides a captivating pathos, subsequently overwhelmed by a plethora of useful information easy to access only on the condition that the reader submits to verbiage and thought rules matching the text’s style (or lack of it). The following chapter titles, followed by my interpretation of their content, may provide perspective.
“Understanding ‘the Ask’” (chapter 1) is an overview of requirements elicitation for large projects. The content is story driven, and more than worthwhile. It may represent the best part of this work, and may be enough to justify its purchase. “Enabling the Security Model” (chapter 2) is a highlight of relevant topics in information technology (IT) security. I don’t believe it’s intended as, and would not consider it, a comprehensive list. “Enabling the Organizational Structure” (chapter 3) is about identifying stakeholders. “The Data Lake Setup” (chapter 4) is a set of high-level, sensible architectural guidelines for the data lake. The chapter’s stated objective--“detailed design of the data lake”--is achieved only if sufficient agreement exists on the otherwise undefined meaning of “detailed.” “Production Playground” (chapter 5) addresses high-level architectural and design principles in delivering accessible, relevant, and actionable data to end users.
“Production Operationalization [sic]” (chapter 6) cursorily addresses mechanisms to deliver according to the principles of chapter 5. I found it grossly inadequate. “Miscellaneous” (chapter 7) is a smorgasbord of generic advice on managing teams, cloud services (focused on Amazon Web Services), frameworks, and other technologies.
The content of this book is often structured as tables, or as overly extensive bullet lists, and includes numerous illustrations. Well-meant suborning of the original, informal intent may result in producing skeleton checklists, which, with reasonable effort by the reader, can become useful when adapted to specific circumstances.
The book is explicitly aimed at data scientists and architects, machine learning engineers, and software engineers. Overall, the information provided, once the reader masters its presentation, is valuable for project-leadership-minded individuals eschewing all hopes of finding “detail” under an engineering semantic ascribed to “detail.” This book could read as a primer on managing large projects with extensive storage needs if it included a glossary, which it does not. It is terse (200 pages of large print) and in dire need of editing.