The authors of this paper provide a satisfying read about name entity recognition (NER) in noisy optical character recognition (OCR) texts. They deliver on their promise of providing answers to many questions that researchers in this area might have.
Packer et al. draw many interesting conclusions about performing the difficult task of extracting names from noisy scanned documents: “Word order errors can play a bigger role in poor extraction performance than character recognition errors”; “The knowledge-based approaches performed better than the machine learning (ML) approaches”; and “Combining basic extraction methods can produce higher quality NER.”
Regarding the conclusion about machine learning approaches, ML lovers need not despair. The authors point out two ways to overcome their deficiencies: either apply a more realistic noise model of OCR errors to the computational natural language learning (CoNLL) training data or use semi-supervised ML techniques to take advantage of the large number of unlabeled documents.