Assumptions

Next: Description of the Up: Problem Description Previous: Problem Description

Assumptions

In order to limit the scope of this project, the research has been limited to solving the problem for a subset of the documents that are normally processed by OCR devices. However, the subset selected is a major portion of the usual OCR pages and the results of this work can be extended to handle a much more varied set of pages.

The type of pages the classifier will be designed for have the following characteristics:

White background and black letters (no color)
Previously segmented pages. The pages have been manually segmented into ``text'', ``table'', ``caption'', ``header/footer'', and other types of zones depending on the contents. The classifier presented in this thesis extracts its features from ``text'' zones only.
No artistic fonts

This work will consider a page to be ``Good'' if its median OCR-accuracy (calculated from a set of accuracies from different OCR devices) is equal to or higher than 90%. Conversely, a page will be labeled ``Bad'' if its accuracy falls below this 90%threshold.