Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Word spotting in historical printed documents using shape and sequence comparisons
Khurshid K., Faure C., Vincent N. Pattern Recognition45 (7):2598-2609,2012.Type:Article
Date Reviewed: Mar 11 2013

A detailed and practical description of methods for character and word recognition in badly printed materials is presented in this paper. The authors compare their work with commercial optical character recognition software and find that their method is able to recognize substantially more words than the commercial products when processing smudged or otherwise poorly printed texts.

The most difficult problem they solve is the segmentation of text that was either badly printed or deteriorated before it could be scanned, so that individual letters were not cleanly separated by white space. The authors use feature detection to recognize when a single shape must be made up of overlapping shapes. They also use a shape equivalent of dynamic time warping to recognize distorted letter images. They use six features to characterize letters. These include upper and lower envelopes, projections on both axes, and the number of transitions across vertical and horizontal slices through the image. Two different approaches to combining adjacent letters were compared: edit distance and a linear matching algorithm that does not consider rearrangements. Linear matching is perhaps eight times faster computationally on words of eight or more letters, with only a slight loss of accuracy.

A careful evaluation was undertaken using 12 19th century books and three 16th century books. The performance improvement over commercial software was much greater with the older books, although the accuracy level with those books was of course lower. Although the computational cost of their segmentation algorithm is greater than linear, the overall performance is affordable given the limited length of words.

I found the paper to be well written, with particularly useful illustrations of the proposed techniques.

Reviewer:  Michael Lesk Review #: CR141008 (1306-0553)
Bookmark and Share
 
Document Analysis (I.7.5 ... )
 
 
Pattern Matching (F.2.2 ... )
 
 
Digital Libraries (H.3.7 )
 
 
Information Search And Retrieval (H.3.3 )
 
 
Pattern Recognition (I.5 )
 
Would you recommend this review?
yes
no
Other reviews under "Document Analysis": Date
Generating indicative-informative summaries with sumUM: a 3D dynamic virtual shop
Saggion H., Lapalme G. Computational Linguistics 28(4): 497-526, 2002. Type: Article
Jun 20 2003
Parameter-Free Geometric Document Layout Analysis
Lee S., Ryu D. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(11): 1240-1256, 2001. Type: Article
Jul 26 2002
A hierarchical neural network document classifier with linguistic feature selection
Chen C., Lee H., Hwang C. Applied Intelligence 23(3): 277-294, 2005. Type: Article
Aug 2 2006
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy