A compression technique designed for the Hebrew language is presented in this paper. The well-known Burrows-Wheeler algorithm is used, but a preprocessing step makes use of the fact that Hebrew words are derived from roots of two, three, or four letters, with morphology characterized by infixing additional letters.
These patterns, with a few exceptions, such as special forms for some final letters, are used in a first step, and roots are extracted where possible. The Burrows-Wheeler algorithm is used to compress both files. Hebrew words are written without vowels, which textually appear as diacritical marks. However, normally these are absent. The main computational obstacle to this method is choosing the set of patterns to use. In the paper, a greedy method is employed, but details are not provided.
The paper’s results are interesting and suggest that morphological features of a language can make material improvements in compression. The paper contains interesting information on the Hebrew language, and is part of a continuing research project on Hebrew text compression.