Computing Reviews, the leading online review service for computing literature.

Search

Data simplification : taming information with open source tools
Berman J., Morgan Kaufmann Publishers Inc., San Francisco, CA, 2016. 398 pp. Type: Book (978-0-128037-81-2)

Date Reviewed: Nov 28 2017

The premise of this book is a good one. We are faced with dramatically increasing volumes of more complex data and we need to figure out how to simplify it in order to make sense of it. Further, since the rise of big data coincides (roughly) with the rapid rise of open-source software, there are plenty of free tools out there that will help us simplify this data--so far, so good. But, the devil, as they say, is in the details. And the devil becomes obvious almost immediately. One would think that a book with the word “simplification” in its title would be, itself, guided by principles of simplification. Sadly, that is not the case. And here are a few examples. There is a glossary at the end of the preface. The preface itself is four-and-a-half pages long and the glossary is ten pages followed by two pages of references. If we are striving for simplicity, we don’t seem to be getting off on the right foot. Chapter 1 begins with an argument in favor of simplification, which is self-defeating as the argument itself is way more complicated than it needs to be. It doesn’t help that the chapter ends with a discussion of open-source software (how did we get there?) followed by another glossary followed by a few more pages of references. It is confusing and overwhelming, but a theme is beginning to emerge. A great deal of simplification can be achieved through organization, more specifically, organizing one’s thoughts. But that point seems to have eluded the author. Looking ahead to chapter 2, at least the title--“Structuring Text”--is promising. This is needed for document translation and text data mining, so maybe things will get better. The first section is titled “The Meaninglessness of Free Text” and the author begins with the assertion, “English is such a ridiculous language that an objective observer might guess that it was designed for the purpose of impeding communication.” One cannot help but think how this accusation could apply to the author’s writing style as well. Nonetheless, we do get a few interesting problematic tidbits, in translating or interpreting text, such as homonyms, and Janus sentences, which are words, phrases, or sentences that mean their opposite. The problem here, by analogy, is that the author is promoting the value of a walk on the beach while emphasizing the hot sand, rocks, broken glass, and jellyfish. The chapter continues with more “Janus” simplification, more open-source examples (the purpose of which is not always clear), followed by another glossary and more references. Chapter 3, “Indexing Text,” is more of the same--some interesting tidbits in a sea of confusion with some open-source examples, followed by a glossary and references. The main contribution that I found in this chapter was the line, “Every book, regardless of its topic, is a representation of the mind of the author.” So as not to appear snarky or mean-spirited, I will not explore that any further. Chapter 4, “Understanding Your Data,” begins with some very basic questions such as: Are the datasets complete?; Is the data annotated with metadata?; and Do the data objects have unique identifiers? This is a promising start, but next is a curious question: Is the dataset annotated with basic Dublin Core information? I recalled encountering the phrase “Dublin Core” in an earlier chapter, but did not know what it was. So, I looked in the glossary. Sadly, it wasn’t there. I looked back at previous glossaries and found a definition of “Dublin Core metadata” in the glossary at the end of chapter 1. Are they the same? Well, there was already a question about metadata, which suggests they are not. On the other hand, maybe it is one of those pesky homonyms mentioned in chapter 2. It would be cruel to continue as the flaws in this book are becoming abundantly clear. It completely fails at its primary objective, which is simplification. It is poorly organized in the extreme. There is some good information and clearly a lot of work went into it. But, the good information is rarely on topic and usually buried in an abundance of other information, the purpose of which is often unclear. In addition, the author focuses more on problems with data simplification than on solutions or best practices. And the open-source tools are not explained well enough for the reader to be able to choose the right tool for a given problem. Data simplification is a major problem today and data scientists need some guidance on how to address it. A book of this type is badly needed. Unfortunately, this book falls very short of the need.

Reviewer: J. M. Artz	Review #: CR145678 (1802-0046)

General (H.2.0 )

Would you recommend this review?

yes

Other reviews under "General":	Date

Design of the Mneme persistent object store Moss J. ACM Transactions on Information Systems 8(2): 103-139, 2001. Type: Article	Jul 1 1991

Database management systems Gorman M., QED Information Sciences, Inc., Wellesley, MA, 1991. Type: Book (9780894353239)	Dec 1 1991

Database management (3rd ed.) McFadden F., Hoffer J., Benjamin-Cummings Publ. Co., Inc., Redwood City, CA, 1991. Type: Book (9780805360400)	Jun 1 1992

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy