There are a large number of documents for which electronic form is desirable, but for which no electronic representation is available. While optical character recognition systems have been available for many years, they are generally restricted to documents that have a highly regular text form. This paper describes an approach to segmenting a printed document (scanned into image format) so that it is partioned into areas that are recognized as text, image, table or separating lines, thus allowing individual document analyses of these contributing types.
The algorithm presented is a “parameter-free geometric document analysis method,” which uses a periodicity measure over a pyramidal quad-tree representation of the image. Advantages claimed for the method are that it can resolve touching or overlapping regions, paragraphs that begin with a large image-format character, and single text lines used as headings. Some comparisons with other methods and commercial software are given, over which the proposed method compares quite favorably.
The paper is reasonably well written, although at times the authors’ non-English background is apparent. This is most evident in the very first paragraph, in which the opening three sentences contain the stem “increase/decrease” five times. The style improves from there on, however, and the rest of the paper is easy to follow.