A first attempt to a page quality estimator has been presented. The classifier is based upon measuring simple features from the connected components' data of a page image. Only parts of the image that contain textual information were used to design and train the classifier.
The features used are the white speckle factor, the broken zone factor and size information. The white speckle factor measured the density of small white connected components, and was designed to capture minimally open lakes in pages with very bold/fat characters. The broken zone factor measures the coverage of an area in the width-height map of the connected components' data, presumably populated by the broken pieces of the characters in a broken-characters' page. The size information metrics measure the maximum average black and white connected components' size as well as the ratio of black to white connected components in order to rule out pages with unusually large fonts and/or inverse video.
After testing the classifier on a 439-page test dataset, it was observed that tables were not correctly processed. Further study on the table pages showed that tables present special difficulty for OCR algorithms. Specifically, a great difference in performance on tables among OCR systems was discovered. The need for a table-specific OCR accuracy evaluation model was acknowledged and the importance of numeric data in table-related OCR errors was emphasized. Because of table problems, a reject region was established to filter out all pages with tables. After that, the classifier was able to correctly filter out all bad pages, with the exception of four. After evaluating these four pages, a new reject region based in the number of connected components was defined to filter out all pages with less than 200 connected components. The classifier is based in computing densities and ratios and, therefore, needs a minimum number of data to work reliably.
From the analysis of the results it was observed that the white speckle factor is not a very robust feature, since it breaks down in the prescence of small font and other special circumstances. The need for a better metric was therefore recognized. The broken character zone factor, on the other hand, proved to be very robust and worked fairly well in all the tests performed. After evaluating some of the size information rules, we observed that one was not being triggered to detect bad pages and was the cause of many misclassifications. The removal of this rule could enhance the classifier.
A new dataset was put together to further test the classifier. Two hundred magazine pages, with radically different characteristics from pages in the previous dataset, were assembled and processed by the classfier. The pages were processed with and without one of the size information rules. The results showed that, for this particular dataset, the rule works fine in classifying a bad page as such, and it did not produce any misclassifications. On the other hand, after removing the rule and re-testing, errors were introduced in the classification. This evidence suggests that the classifier, as designed, is data dependant and it must be tuned for the kind of data it will be processing.
Testing was performed assumming that a good page has a median OCR accuracy of 90%and above; the results hereby described are based upon this assumption. However, testing was also performed at the 95%and 98%levels. In these cases, the performance of the classifier degraded gracefully. The conjecture is that more complex features are required to identify subtler image defects and, even then, there are errors that are not related to image quality that will not be captured by the classifier.
A number of features that were observed but not used are presented. Among them, the black density, micro-gap detection and skew are very promising page-quality related indicators. Similarly, testing on fax and foreign language documents is identified to be a requirement in producing a full-blown image quality classifier. For this purpose, a heuristic binary decision system such as the one presented is not sophisticated enough. A more standard statistical approach should be taken.
This work has presented an image-quality estimator based on very simple features. The classifier was able to attain a stable error rate of approximately 14%in two different datasets. The contribution of this research lies in being the first approach at such a classifier and in establishing a base for future research to build on.