A detailed and practical description of methods for character and word recognition in badly printed materials is presented in this paper. The authors compare their work with commercial optical character recognition software and find that their method is able to recognize substantially more words than the commercial products when processing smudged or otherwise poorly printed texts.
The most difficult problem they solve is the segmentation of text that was either badly printed or deteriorated before it could be scanned, so that individual letters were not cleanly separated by white space. The authors use feature detection to recognize when a single shape must be made up of overlapping shapes. They also use a shape equivalent of dynamic time warping to recognize distorted letter images. They use six features to characterize letters. These include upper and lower envelopes, projections on both axes, and the number of transitions across vertical and horizontal slices through the image. Two different approaches to combining adjacent letters were compared: edit distance and a linear matching algorithm that does not consider rearrangements. Linear matching is perhaps eight times faster computationally on words of eight or more letters, with only a slight loss of accuracy.
A careful evaluation was undertaken using 12 19th century books and three 16th century books. The performance improvement over commercial software was much greater with the older books, although the accuracy level with those books was of course lower. Although the computational cost of their segmentation algorithm is greater than linear, the overall performance is affordable given the limited length of words.
I found the paper to be well written, with particularly useful illustrations of the proposed techniques.