The authors devote this research paper to the analysis of two text encoding techniques, defined by the longest match (LM) and greedy (G) heuristics. Those techniques are considered over appropriate dictionaries; the dictionary is viewed as a set of pairs (string, code of string), with the strings over a suitable alphabet.
Given a string over an alphabet, the problem of optimal compression may be formulated as a suitable storage minimization problem for the string. Formally, the technique used in the paper is based on a network representation of the dictionary; thus, a space-optimal problem for the encoding may be treated as a problem of finding a shortest path between a pair of vertices in a network (see Schuegraf and Heaps [1] and Storer [2]).
The first heuristic, the LM algorithm, is a direct choice of the longest dictionary string that matches the original text, processing the string from left to right. In particular, the authors prove that in the case of a code-uniform suffix dictionary (that is, the length of the codes is constant and the dictionary is closed under the taking of suffixes for all dictionary strings), the LM algorithm is optimal for any string. This result is a corollary of the authors’ estimation of the ratio C/O, where C is the length of compressed string and O stands for optimal storage length.
The G algorithm [3] is based on the best local compression criterion, defined as the difference between the number of bits in the input string and the corresponding code. The LM and G algorithms are, in general, incompatible modulo compression ratio. For any code-uniform dictionary, the methods are equipotent. For general dictionaries, the authors show that the algorithms behave similarly. All the precise results deal with upper boundaries for the compression ratio (both algorithms are considered). The authors pointed out that one problem is still open--what is a precise upper bound for the LM methods for suffix dictionaries?
It would be interesting to apply this technique to adaptive compression methods (with nonuniform dynamic dictionaries) and, more generally, to prefix dictionaries. A dictionary adjusted for a given set of database statistics would be another interesting application, since the actual compression for databases is applied in practice to dictionary features only.
The paper is well written and easy to read. It provides necessary details and is well organized. It will be useful for those who are interested in text compression, including adjusting heuristics for databases.