Computing Reviews, the leading online review service for computing literature.

Search

An analysis of the longest match and the greedy heuristics in text encoding
Katajainen J., Raita T. Journal of the ACM39 (2):281-294,1992.Type:Article

Date Reviewed: Mar 1 1993

The authors devote this research paper to the analysis of two text encoding techniques, defined by the longest match (LM) and greedy (G) heuristics. Those techniques are considered over appropriate dictionaries; the dictionary is viewed as a set of pairs (string, code of string), with the strings over a suitable alphabet. Given a string over an alphabet, the problem of optimal compression may be formulated as a suitable storage minimization problem for the string. Formally, the technique used in the paper is based on a network representation of the dictionary; thus, a space-optimal problem for the encoding may be treated as a problem of finding a shortest path between a pair of vertices in a network (see Schuegraf and Heaps [1] and Storer [2]). The first heuristic, the LM algorithm, is a direct choice of the longest dictionary string that matches the original text, processing the string from left to right. In particular, the authors prove that in the case of a code-uniform suffix dictionary (that is, the length of the codes is constant and the dictionary is closed under the taking of suffixes for all dictionary strings), the LM algorithm is optimal for any string. This result is a corollary of the authors’ estimation of the ratio C/O, where C is the length of compressed string and O stands for optimal storage length. The G algorithm [3] is based on the best local compression criterion, defined as the difference between the number of bits in the input string and the corresponding code. The LM and G algorithms are, in general, incompatible modulo compression ratio. For any code-uniform dictionary, the methods are equipotent. For general dictionaries, the authors show that the algorithms behave similarly. All the precise results deal with upper boundaries for the compression ratio (both algorithms are considered). The authors pointed out that one problem is still open--what is a precise upper bound for the LM methods for suffix dictionaries? It would be interesting to apply this technique to adaptive compression methods (with nonuniform dynamic dictionaries) and, more generally, to prefix dictionaries. A dictionary adjusted for a given set of database statistics would be another interesting application, since the actual compression for databases is applied in practice to dictionary features only. The paper is well written and easy to read. It provides necessary details and is well organized. It will be useful for those who are interested in text compression, including adjusting heuristics for databases.

Reviewer: S. Berger	Review #: CR116281

1)	Schuegraf, E. J. and Heaps, H. S. A comparison of algorithms for database compression by use of fragments as language elements. Inf. Stor. Ret. 10 (1974), 309–319.

2)	Storer, J. A. Data compression, methods and theory. Computer Science Press, Baltimore, MD, 1988.

3)	Gonzalez-Smith, M. E. and Storer, J. A. Parallel algorithms for data compression. J. ACM 32, 2 (April 1985), 344–373.

Data Compaction And Compression (E.4 ... )

Path And Circuit Problems (G.2.2 ... )

Would you recommend this review?

yes

Other reviews under "Data Compaction And Compression":	Date

Data compression (3rd ed.) Held G., John Wiley & Sons, Inc., New York, NY, 1991. Type: Book (9780471929413)	Apr 1 1992

Data compression using an intelligent generator: the storage of chess games as an example Althöfer I. Artificial Intelligence 52(1): 109-113, 1991. Type: Article	Jan 1 1993

Practical dictionary management for hardware data compression Bunton S., Borriello G. Communications of the ACM 35(1): 95-104, 1992. Type: Article	Sep 1 1993

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy