Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Efficient discovery of longest-lasting correlation in sequence databases
Li Y., U L., Yiu M., Gong Z. The VLDB Journal: The International Journal on Very Large Data Bases25 (6):767-790,2016.Type:Article
Date Reviewed: Mar 22 2017

Large quantities of data are stored every day in databases, but sooner or later these data must be extracted according to different criteria: these extractions constitute data sequences. One of the most important problems when extracting data is not only to extract a sequence of a given length, but also to find how many such sequences there are in the database. The standard tool to perform such extractions until now has been the Pearson correlation; this paper presents the k-longest-lasting correlated subsequence query (kLCS), a technique that extends the Pearson correlation, overcoming its limitations and introducing a few computational advantages in terms both of execution time and resource allocation. kLCS queries are built on the Pearson correlation, but perform much more efficiently, basically because they do not need any prior knowledge about the query length.

The paper starts by explaining that kLCS is not only a standalone technique, but that it can also be applied to a number of existing querying techniques, either indexed or nonindexed, such as skipping cumulative arrays (not indexed) and piecewise aggregate approximation (indexed); indexed techniques, of course, have better performances. Then, it goes on to present some new techniques, developed by the authors, which embed kLCS in their very structure. These techniques are zigzag execution strategy (not indexed) and diamond cover index (indexed). Again, the paper shows that not only indexed techniques have better performance than nonindexed ones, but also that techniques that use kLCS natively perform better than ones to which kLCS is applied as a retrofitting measure. The paper describes all these techniques both from a formal point of view, and in the form of high-level algorithms; great importance is also given to data location: performance varies when data reside in main memory or in external disks.

The last part of the paper presents results from a wide number of tests, on data residing both in memory and external disks, from datasets made to order explicitly for testing purposes as well real-world datasets, taken from financial, atmospheric, and pattern recognition environments. The results show that for all datasets used, big differences in performance arise both from data location and from the presence of indexes. The real worth of the paper lies in this last part: although meant for an academic audience, it highlights the relative strengths and weaknesses of the different data sequence extraction techniques available today.

Reviewer:  Andrea Paramithiotti Review #: CR145133 (1706-0388)
Bookmark and Share
  Featured Reviewer  
 
Time Series Analysis (G.3 ... )
 
 
Database Management (H.2 )
 
Would you recommend this review?
yes
no
Other reviews under "Time Series Analysis": Date
Distributed recognition of patterns in time series data
Morrill J. Communications of the ACM 41(5): 45-51, 1998. Type: Article
Sep 1 1998
A software-supported process for assembling evidence and handling uncertainty in decision-making: an experiment with the shortest-paths algorithms
Davis J., Hall J. Decision Support Systems 35(3): 415-433, 2003. Type: Article
Jul 22 2003
Using deterministic chaos theory for the analysis of sleep EEG
Rand J., Collin H., Kapuniai L., Crowell D., Pearce J. In Formal descriptions of developing systems. Hingham, MA: Kluwer Academic Publishers, 2003. Type: Book Chapter
Apr 8 2004
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy