Computing Reviews, the leading online review service for computing literature.

Search

Deterministic sampling
Vishkin U. SIAM Journal on Computing20 (1):22-40,1991.Type:Article

Date Reviewed: Jun 1 1992

Pattern matching is a basic operation in computing: in editors and operating systems, in databases, in certain programming languages, and in several areas of artificial intelligence. Pattern matching algorithms often have two phases: a preprocessing of the pattern (pattern analysis) and the actual search of the text (text analysis). The task addressed in this paper is the identification of all starting locations of a fixed linear pattern string in a longer text string. A string is a sequence of characters that are to be matched exactly. Many applications, however, require more powerful types of matching, including regular expressions, approximate matching, higher dimensional matching, and unification. This algorithm is presented for the weakest form of the problem, although the author gives pointers to extension to some of the above variants. Vishkin’s motivation for his results in this area was the problem of parallel pattern matching, that is, pattern matching using multiple processors. The serialization of his parallel algorithm also breaks new ground, however. His introductory review section considers serial algorithms, randomized algorithms, and finally parallel algorithms. For a pattern string of length m and a text string of length n, the text analysis phase is deterministic and requires O(logn) operations (linear serial time) and O(logn) parallel time; the pattern analysis phase in its randomized form takes (with high probability) only O(log m) parallel time, while the deterministic form requires O(log²m/log log m). (logn is the number of stages of successive application of log to n before the result is reduced to 1 or less.) These phases and forms may be executed with an optimal number of processors; that is, they exhibit optimal parallel speedup. A constant-time text analysis (requiring a non-optimal O(n log m) processors) is also possible. The algorithm proceeds from the idea that, for each initial position of the pattern string in the text string, it is useful to test a relatively small number of sample characters from the pattern string for a match with the corresponding position of the text string. It also makes use of the fact that periodic pattern strings may be treated as a sequence of overlapping maximal nonperiodic strings. Using samples from this maximal nonperiodic string will disqualify many non-matches. The observation in the author’s earlier work was that many additional locations may be eliminated--all but one of the nondisqualified locations within a range equal to half the length of the (nonperiodic) pattern. The potentially overlaid copies must be distinguishable in at least one character position or column. For this purpose, Vishkin introduced an array WITNESS, which identifies columns that distinguish copies overlapped to the different possible extents. The new ideas, which make the improved results of this paper possible, focus on a deterministic sample, so-called because it can be produced by a deterministic algorithm with the help of the WITNESS. In fact, the definitions of the WITNESS and the deterministic sample do allow some latitude under certain conditions. The deterministic sample is a set of columns logarithmic in the size of the pattern. It has the property that a set of overlapping copies includes one distinguished copy that intersects all columns at the labeling character, while every other copy intersects at least one column at a character distinct from the label. The pattern analysis algorithm uses the WITNESS to include in the sample a column that distinguishes a (consistently but arbitrarily) specified pair of copies. The analysis then continues inductively with the smaller of the two subsets of the undistinguished copies that agree with one or the other of the specified copies in the identified column. The use of the deterministic sample in the text analysis phase is relatively straightforward, but the pattern analysis algorithm is rather subtle. Vishkin is, however, at pains to annotate and introduce his formal presentation of the algorithms with pointers to the reader, which indeed facilitate a quicker understanding of them. The paper is presented as a theoretical work, and the algorithms are not presented in anything approaching pseudocode: they will require thorough understanding by the practitioner who wants to implement them. No benchmarks or evidence of implementation are presented. It is not immediately clear that Vishkin’s techniques can usefully be applied to the full range of search problems mentioned in this review. Probably techniques similar to that used to handle periodicity could be developed, but the neighborhood in which the disqualification property will operate may be limited to individual keywords or string-like segments. The paper, however, gives some pointers to the handling of higher dimensional patterns and approximate matches, and applications clearly exist that could make use of the algorithm as it stands. I have no better way of expressing the significance of this work than to repeat Vishkin’s own surprising perspective: “assuming some preprocessing of the pattern, the text analysis problem is actually easier than finding the maximum among n elements.”

Reviewer: David Powers	Review #: CR115381

Pattern Matching (F.2.2 ... )

Indexing Methods (H.3.1 ... )

Search Process (H.3.3 ... )

Deduction And Theorem Proving (I.2.3 )

Would you recommend this review?

yes

Other reviews under "Pattern Matching":	Date

Two-way string-matching Crochemore M., Perrin D. (ed) Journal of the ACM 38(3): 650-674, 1991. Type: Article	Sep 1 1992

On the exact complexity of string matching: lower bounds Galil Z., Giancarlo R. SIAM Journal on Computing 20(6): 1008-1020, 1991. Type: Article	Jan 1 1993

Alphabet independent two dimensional matching Amir A., Benson G., Farach M. Theory of computing (, Victoria, British Columbia, Canada, May 4-6, 1992)681992. Type: Proceedings	Jul 1 1993

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy