Title Page



Next: Acknowledgements

Evaluation of Page Quality
Using Simple Features

by

Luis R. Blando


A thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science
in
Computer Science
Department of Computer Science
University of Nevada, Las Vegas
November 1994

© 1994 Luis R. Blando
All Rights Reserved

pdf


ABSTRACT

A classifier to determine page quality from an Optical Character Recognition (OCR) perspective is developed. It classifies a given page image as either ``good'' (i.e., high OCR accuracy is expected) or ``bad'' (i.e., low OCR accuracy expected). The classifier is based upon measuring the amount of white speckle, the amount of broken pieces, and the overall size information in the page. Two different sets of test data were used to evaluate the classifier: the Sample 2 dataset containing 439 pages and the Magazines dataset containing 200 pages. The classifier recognized 85%of the pages in the Sample 2 correctly. However, approximately 40%of the low quality pages were misclassified as ``good.'' To solve this problem, the classifier was modified to reject pages containing tables or less than 200 connected components. The modified classifier rejected 41%of the pages, correctly recognized 86%of the remaining pages, and did not misclassify any low quality page as ``good''. Similarly, it recognized 86.5%of the pages in the Magazine dataset correctly and did not misclassify any low quality page as ``good'' without any rejections.


Downloadable versions:


Contents