The quantity of data stored online quadruples every 18 months. In order to transform this data into information and knowledge, it is essential to develop effective data analysis and extraction capabilities. This clearly written paper presents (using a textual representation on page 94 that is substantially better and semantically richer than the pictorial one in Figure 1) a typical environment in which data is collected, stored, and validated at “ground stations” that send this data to “data product generation centers which generate higher level data products.” The generation cycle of these centers consists of a data product generation phase and a quiet period, during which new data accumulates. The paper does not refer to the semantics of data product generation, or to the semantics of relationships between data, despite the clear need for data collections and data products to be structured, since end users request data products of interest.
A very simple analytical model (based on processor and input/output (I/O) subsystem utilization) is presented to answer such questions as: What is the maximum data product retrieval rate? What is the average data product retrieval time for a given data product retrieval rate? What is the impact of quiet period duration on the data product retrieval time? Some numerical results shown using this model appear to be in agreement with qualitative considerations. The model and the results are straightforward, and it appears that they handle only some, and not the most interesting (or important), aspects of scalability, because, first, data selection of any kind ought to be based on semantic considerations (not addressed in the paper), and, second, concepts such as parallelism or indexing are not mentioned at all.
Finally, the author emphasizes the need to provide semantically rich ontologies associated with metadata, in order to solve the problem of semantic information integration across different scientific domains, due, in particular, to different terminologies. This important problem, and ways to solve it, has been well known for decades; no references are provided in the paper.
In summary, the paper may be a good illustration of Hayek’s observation that, in the use of statistics, we “either deliberately ignore or are ignorant of the relations between the individual elements with different attributes,” while, in dealing with complexity, “it is precisely the[se] relations that matter” [1].