A new online O ( n ) algorithm for identifying all the occurrences of a given pattern p = p [ 1 , m ] in a given text string t = t [ 1 , n ] is presented. The new algorithm modifies an algorithm due to Colussi [1] and requires at most ( 4 n - m ) &slash; 3 comparisons between characters of p and characters of t; this total does not include the comparisons between characters of p that are required during the preprocessing of p (computation of the periodic decomposition). A companion paper[2] has already established that, in the case m = 3, the identification of all occurrences of p by an online algorithm requires at least ( 4 n - m ) &slash; 3 comparisons; thus there are grounds for conjecturing that in this sense the new algorithm may be optimal.
The new algorithm is derived from a detailed and complex analysis of Colussi’s algorithm. The main idea of Colussi’s algorithm is to process the “noholes” (positions of interrupted periodicity) of the pattern from left to right, and then, only if all noholes match corresponding positions in the text, to compare the remaining positions (“holes”) from right to left. Colussi’s algorithm thus combines elements of the Knuth-Morris-Pratt and Boyer-Moore algorithms. The result is an algorithm in which certain positions of the text may not be tested at all. Based on their analysis, the authors show that Colussi’s algorithm can be improved by handling certain shifts of the pattern in a more efficient way. Both Colussi’s algorithm and the new algorithm operate on patterns that are assumed not to be strongly periodic; the authors show, however, that any such algorithm necessarily gives rise to another O ( n ) online algorithm that matches strongly periodic patterns using no more character comparisons between pattern and text. The authors claim that Colussi’s algorithm (and thus presumably also the new algorithm) performs well in practice; it would be interesting to test these algorithms against other recently reported fast pattern-matching algorithms, such as Wu and Manber’s [3].
This paper is important and interesting, as much for its methodology as its result. It is not for the fainthearted, however: the main result (Theorem 3) is stated after 30 pages of complex exposition including 20 lemmas and 12 “facts.” Not surprisingly, therefore, it is difficult to read--I found that half an hour per page was the best I could do. Although it is generally well written, the paper contains a few minor inaccuracies to distract the reader; it would have been improved, and its reading time reduced, by the use of illustrative examples throughout (it contains none) and a more generous use of drawings and figures.