Summary

After the implementation of a reject region to filter out tables and small zones, the classifier was able to correctly detect all pages with OCR accuracy of less than 90%in the Test Dataset. Some misclassifications of Good pages as Bad were incurred, but the overall error rate was consistently below 15%.

Reject regions had to be implemented because the current version of the classifier is not able to detect defects on images containing a very low number of connected components (characters) and because table-generated OCR errors do not appear to be directly related (dependent) to image quality.

The system degraded gracefully as the cutoff threshold for ``good'' and ``bad'' labels was moved up. This is to be expected, mainly because this classifier uses only simple features. A more complex approach is needed to differentiate accuracies in the 95%and above region.

The classifier also processed a completely different dataset, the Magazine dataset. It performed flawlessly in filtering out bad pages at the 90%threshold.

The simple features selected have proven to be useful in detecting image quality to a certain level of detail. The results indicate that the classifier logic would be applicable not only to pages conforming to the type it was created for, but also to other types of pages and possibly to all pages. Further testing is required to validate this last hypothesis since improved features would be required to increase the level of detail the classifier must be able to detect.