New Datasets
Next: Concluding Remarks
Up: Future Work
Previous: Features Observed but
In this work, only one type of page has been concentrated on. In order
to produce a full-blown system, a more heterogeneous dataset must be
devised. Among the kind of data that would certainly be needed are:
- Faxed documents. The poor resolution and the amount of
line noise introduced in faxed documents make this type of data ideal
for page quality classification. Reseach is underway at ISRI to address
this issue.
- Foreign Language Documents. In any classifier based on
``normal'' vs. ``abnormal'' ratios, any change in the ``correct''
character set can be a major problem, since the ratios can and often
do change. It is therefore of great importance to tune the classifier
for the language of choice. Note that since lexicographical features
were not used, nor was the OCR output relied upon, changes for
alphabets with characters highly similar to English (like Spanish)
should be minimal. For other languages (i.e., Japanese, Chinese,
Arabic); however, different features will have to be studied since the
characters in these languages are radically different to the ones used
in English.