Computing Reviews

A parallel algorithm for record clustering
Omiecinski E., Scheuermann P. ACM Transactions on Database Systems15(3):599-624,1990.Type:Article
Date Reviewed: 11/01/92

We start with a large file and a collection of queries to be processed against that file. We assume that the frequency of occurrence of each query and the records that satisfy it are known. The goal is to cluster the records in such a way as to minimize the total number of pages accessed in order to satisfy the collection of queries. In this paper, the authors present a heuristic algorithm whose output is a list of clusters containing the records satisfying different subsets of queries. The general problem is NP-hard, so this algorithm is heuristic and is only near-optimal.

Building on earlier work, the authors use what they call a P-tree, a binary tree whose nodes at level i represent the clusters formed after the first i queries using a precedence order. Because the formation of this tree requires little and regular communication among nodes, it is suitable for execution on a SIMD computer. Although the authors’ model of a SIMD machine is somewhat unusual--the processors are only assumed to have communication paths that link them into a linear array--the model is subsumed by most extant designs and thus is not unrealistic.

The authors have done a number of synthetic experiments testing their algorithm against an adaptive clustering algorithm. They present the conclusion that their method is competitive.

Reviewer:  D. A. Buell Review #: CR115118

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy