In order to limit the scope of this project, the research has been limited to solving the problem for a subset of the documents that are normally processed by OCR devices. However, the subset selected is a major portion of the usual OCR pages and the results of this work can be extended to handle a much more varied set of pages.
The type of pages the classifier will be designed for have the following characteristics:
This work will consider a page to be ``Good'' if its median OCR-accuracy (calculated from a set of accuracies from different OCR devices) is equal to or higher than 90%. Conversely, a page will be labeled ``Bad'' if its accuracy falls below this 90%threshold.