Other Approaches
Next: Classifier Design
Up: Related Work
Previous: OCR Difficulty Evaluation
Other approaches to the measurement of print quality for OCR systems include:
- Physical approaches. Throssell and Fryer [16] and
Bohner et al [2] proposed mechanical systems to measure print
quality as defined by ISO Recommendation 1831 (1968). These two works
date back to the mid 1970s, when OCR systems were not popular except
in commercial/financial institutions. As a result, both papers
concentrated on ways to define print quality for OCR-A and OCR-B
character sets. Their approach is to construct a high resolution
scanning device to calculate Print Contrast Signal values, which are
then used to rate each individual character according to the ISO
recommendation. These approaches are not practical for current OCR
needs, not only because of the cost associated in building these
special scanning devices, but also because current OCR environments
are omnifont; both of these approaches are very limited in selection
of font-type and the fonts used must be known beforehand.
- Using OCR output. A popular way to estimate page
difficulty for OCR output is simply to process the image first and
then use the reject and/or suspect markers in the OCR output to
estimate page quality. The drawback of this approach is that it is
completely dependent on the OCR device being used. Furthermore, this
approach is dependent on the capabilities of that particular OCR
device to produce reject/suspect markers. If the OCR device does not
produce reject/suspect markers, or if it does so very poorly, this
method is useless.
- Using spell checkers. Another approach to estimate page
quality would be to examine the OCR output using a spell checker to
see how many words are not found in the dictionary. The problem with
this approach lies in that, for many types of data, no words will be
found in the lexicon. Proper names, acronyms, and numerical data are
all examples of types of data that can not be corrected by simply
using lexicon lookup. As a result, a metric that measures, for
instance, the number of non-found words would underestimate the
accuracy when presented with this ``non-standard'' type of data.
Next: Classifier Design
Up: Related Work
Previous: OCR Difficulty Evaluation