Problem Description

The objective of this research is to develop a classifier for predicting OCR accuracy by measuring image defects. In other words, this classifier measures image quality from an OCR perspective.

An ideal output of such a quality metric would be the actual accuracy any given OCR device would attain on the page. This is a very complex problem because of the rapid progress of OCR technology as well as the complexity and number of features needed for such a task. Therefore, this work will concentrate on the design and development of a binary classifier. The output of the system should be a label of ``Good'' or ``Bad'', depending on the accuracy that page would attain if processed through an OCR device. ``Good'' means the page image is clean and has an expected OCR accuracy of at least 90%, whereas ``Bad'' pages may have different degrees of noise in their images and the expected OCR accuracy for them would be below 90%.

Assumptions