New Datasets

In this work, only one type of page has been concentrated on. In order to produce a full-blown system, a more heterogeneous dataset must be devised. Among the kind of data that would certainly be needed are:

Faxed documents. The poor resolution and the amount of line noise introduced in faxed documents make this type of data ideal for page quality classification. Reseach is underway at ISRI to address this issue.
Foreign Language Documents. In any classifier based on ``normal'' vs. ``abnormal'' ratios, any change in the ``correct'' character set can be a major problem, since the ratios can and often do change. It is therefore of great importance to tune the classifier for the language of choice. Note that since lexicographical features were not used, nor was the OCR output relied upon, changes for alphabets with characters highly similar to English (like Spanish) should be minimal. For other languages (i.e., Japanese, Chinese, Arabic); however, different features will have to be studied since the characters in these languages are radically different to the ones used in English.