Computing Reviews, the leading online review service for computing literature.

Search

Word spotting in historical printed documents using shape and sequence comparisons
Khurshid K., Faure C., Vincent N. Pattern Recognition45 (7):2598-2609,2012.Type:Article

Date Reviewed: Mar 11 2013

A detailed and practical description of methods for character and word recognition in badly printed materials is presented in this paper. The authors compare their work with commercial optical character recognition software and find that their method is able to recognize substantially more words than the commercial products when processing smudged or otherwise poorly printed texts. The most difficult problem they solve is the segmentation of text that was either badly printed or deteriorated before it could be scanned, so that individual letters were not cleanly separated by white space. The authors use feature detection to recognize when a single shape must be made up of overlapping shapes. They also use a shape equivalent of dynamic time warping to recognize distorted letter images. They use six features to characterize letters. These include upper and lower envelopes, projections on both axes, and the number of transitions across vertical and horizontal slices through the image. Two different approaches to combining adjacent letters were compared: edit distance and a linear matching algorithm that does not consider rearrangements. Linear matching is perhaps eight times faster computationally on words of eight or more letters, with only a slight loss of accuracy. A careful evaluation was undertaken using 12 19th century books and three 16th century books. The performance improvement over commercial software was much greater with the older books, although the accuracy level with those books was of course lower. Although the computational cost of their segmentation algorithm is greater than linear, the overall performance is affordable given the limited length of words. I found the paper to be well written, with particularly useful illustrations of the proposed techniques.

Reviewer: Michael Lesk	Review #: CR141008 (1306-0553)

Document Analysis (I.7.5 ... )

Pattern Matching (F.2.2 ... )

Digital Libraries (H.3.7 )

Information Search And Retrieval (H.3.3 )

Pattern Recognition (I.5 )

Would you recommend this review?

yes

Other reviews under "Document Analysis":	Date

Generating indicative-informative summaries with sumUM: a 3D dynamic virtual shop Saggion H., Lapalme G. Computational Linguistics 28(4): 497-526, 2002. Type: Article	Jun 20 2003

Parameter-Free Geometric Document Layout Analysis Lee S., Ryu D. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(11): 1240-1256, 2001. Type: Article	Jul 26 2002

A hierarchical neural network document classifier with linguistic feature selection Chen C., Lee H., Hwang C. Applied Intelligence 23(3): 277-294, 2005. Type: Article	Aug 2 2006

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy