Other Approaches

Other approaches to the measurement of print quality for OCR systems include:

Physical approaches. Throssell and Fryer [16] and Bohner et al [2] proposed mechanical systems to measure print quality as defined by ISO Recommendation 1831 (1968). These two works date back to the mid 1970s, when OCR systems were not popular except in commercial/financial institutions. As a result, both papers concentrated on ways to define print quality for OCR-A and OCR-B character sets. Their approach is to construct a high resolution scanning device to calculate Print Contrast Signal values, which are then used to rate each individual character according to the ISO recommendation. These approaches are not practical for current OCR needs, not only because of the cost associated in building these special scanning devices, but also because current OCR environments are omnifont; both of these approaches are very limited in selection of font-type and the fonts used must be known beforehand.
Using OCR output. A popular way to estimate page difficulty for OCR output is simply to process the image first and then use the reject and/or suspect markers in the OCR output to estimate page quality. The drawback of this approach is that it is completely dependent on the OCR device being used. Furthermore, this approach is dependent on the capabilities of that particular OCR device to produce reject/suspect markers. If the OCR device does not produce reject/suspect markers, or if it does so very poorly, this method is useless.
Using spell checkers. Another approach to estimate page quality would be to examine the OCR output using a spell checker to see how many words are not found in the dictionary. The problem with this approach lies in that, for many types of data, no words will be found in the lexicon. Proper names, acronyms, and numerical data are all examples of types of data that can not be corrected by simply using lexicon lookup. As a result, a metric that measures, for instance, the number of non-found words would underestimate the accuracy when presented with this ``non-standard'' type of data.

Next: Classifier Design Up: Related Work Previous: OCR Difficulty Evaluation