ComputingReviews.com

Efficient discovery of longest-lasting correlation in sequence databases
Li Y., U L., Yiu M., Gong Z. The VLDB Journal: The International Journal on Very Large Data Bases25(6):767-790,2016.Type:Article

Date Reviewed: 03/22/17

Large quantities of data are stored every day in databases, but sooner or later these data must be extracted according to different criteria: these extractions constitute data sequences. One of the most important problems when extracting data is not only to extract a sequence of a given length, but also to find how many such sequences there are in the database. The standard tool to perform such extractions until now has been the Pearson correlation; this paper presents the k-longest-lasting correlated subsequence query (kLCS), a technique that extends the Pearson correlation, overcoming its limitations and introducing a few computational advantages in terms both of execution time and resource allocation. kLCS queries are built on the Pearson correlation, but perform much more efficiently, basically because they do not need any prior knowledge about the query length.

The paper starts by explaining that kLCS is not only a standalone technique, but that it can also be applied to a number of existing querying techniques, either indexed or nonindexed, such as skipping cumulative arrays (not indexed) and piecewise aggregate approximation (indexed); indexed techniques, of course, have better performances. Then, it goes on to present some new techniques, developed by the authors, which embed kLCS in their very structure. These techniques are zigzag execution strategy (not indexed) and diamond cover index (indexed). Again, the paper shows that not only indexed techniques have better performance than nonindexed ones, but also that techniques that use kLCS natively perform better than ones to which kLCS is applied as a retrofitting measure. The paper describes all these techniques both from a formal point of view, and in the form of high-level algorithms; great importance is also given to data location: performance varies when data reside in main memory or in external disks.

The last part of the paper presents results from a wide number of tests, on data residing both in memory and external disks, from datasets made to order explicitly for testing purposes as well real-world datasets, taken from financial, atmospheric, and pattern recognition environments. The results show that for all datasets used, big differences in performance arise both from data location and from the presence of indexes. The real worth of the paper lies in this last part: although meant for an academic audience, it highlights the relative strengths and weaknesses of the different data sequence extraction techniques available today.

Reviewer: Andrea Paramithiotti

Review #: CR145133 (1706-0388)

Reproduction in whole or in part without permission is prohibited. Copyright 2024 ComputingReviews.com™
Terms of Use | Privacy Policy