Basic Processing Model

Next: Test Data Set Up: Classifier Testing Architecture Previous: Classifier Testing Architecture

Basic Processing Model

Figure 4.1 shows modules in the classifier. The ccomp program generates the black and white connected components from a TIFF image file. The two CC files are then read by the clas program which calculates the features, applies the classification rules, and generates the results file, from where the reports are then extracted.

The accuracy value from the OCR processing of the image is used only for generating the output tables and is not used by the classifier's logic in any other way.

Figure 4.11: Classifier Logic Architecture

To automate the testing of a large number of images, the following steps are followed:

Create a list of all the image-names that the test dataset will contain.
Iterate over this list generating the connected components data files (two per image, -black and white-).
Iterate over this list classifying each page and outputting the results to one file.
Derive tables from the output file and the (independently tested) OCR accuracy values.

The reports and confusion matrices are generated automatically from the results file. The scripts to perform these tasks are written in the PERL programming language [17]. The connected components finder is written in C, as is the feature extractor from the CC data. The whole process is driven by a PERL script which produces the result file.