Error Analysis



Next: Higher Good Thresholds Up: Results and Analysis Previous: Unfiltered Results

Error Analysis

The 15 ``B G'' misclassifications (Table 4.2) were carefully examined. The page images did not present substantial degradation, confirming the results of the classifier. Figure 4.2 shows excerpts from some of these images where the quality can be appreciated.

Figure 4.2: Excerpts from B->G misclassified images (magnified)

All but 4 of these 15 ``problematic'' pages contained tables in them. The OCR output for many of these tables were illegible and generally useless. Figures 4.3 and 4.4 show a clean table image and its associated OCR output. Tables pose special problems to OCR devices.

Figure 4.3: Clean table image

The classifier labeled all these 11 pages containing tables ``Good'' because, from an image defects point of view, it could not find enough evidence to give the pages a ``Bad'' label. The contents of these pages, however, suggest that table-generated OCR errors are special and, therefore, are not related to image-generated OCR errors. After evaluating all 11 ``Bad Good'' misclassified ``table pages'', the following observations are in order:

Table Observation 1.
There is a marked difference in OCR performance among different OCR devices when handling pages with tables. Table 4.4 lists the 11 BG misclassified pages. It can be seen that, in general, devices in the left part of the table tend to do much better than those on the right part of the table. Because of this variability, we can no longer assume that a page with median accuracy is a bad page, since the OCR results can be radically different depending upon the OCR device used.

Table Observation 2.
Character recognition measures are not enough for Table-OCR evaluation. When tables are present, an OCR user is not only interested in the contents (i.e., characters) on the table but also in the table's overall layout. This information is critical for tables with empty cells, where if the OCR algorithm does not generate the proper layout, neighboring cell values can be wrongly assigned to these empty cells. A character recognition measure such as the one used at ISRI, is not sophisticated enough to check table formatting and thus not well suited for the kind of accuracy evaluation in consideration. The development of new ways to measure Table-OCR accuracy is beyond the scope of this work, but should be addressed if table-generated OCR output is to be evaluated.

Table Observation 3.
Numeric tables seem to present more difficulty to OCR algorithms than textual tables and normal text. The majority of the errors found in the OCR output for the evaluated tables stemmed from numerical data. Substitutions of ``0'' by ``O'', ``1'' by ``l'' were among the most common. Some OCR devices are strongly biased towards textual information and, therefore, make these kind of errors in purely numeric data. Furthermore, OCR devices cannot use lexicon-based correction for numerical data.

It is important to make clear that none of these three observations are dependent on page quality. Furthermore, after visually inspecting the images, tables that presented poor image quality were in general correctly flagged as ``Bad'' by the classifier and their OCR output was subject to the normal image quality-related errors (in addition to the special problems posed by tables and mentioned above). Similarly, pages labeled as ``Good'' by the classifier were visually inspected and found to indeed have high image quality. Some of these pages were considered ``Bad'' in the experiment because their low OCR accuracy stem from the special characteristics of tables mentioned above and not from image defects. Furthermore, many of the ``BG'' pages would not be misclassified if the devices used in the experiment had been a selected subset of the ones used. This variability in OCR output is not usual in ``normal'' textual pages.

Therefore, a reject region was established to filter-out all the pages containing tables. These pages would need to be processed by another type of classifier since the difficulty they present to OCR algorithms does not result from image quality, but from the complexity of their contents and layout. Table 4.5 shows the confusion matrix with the addition of the Table Reject column. All the pages with tables in them were not processed by the classifier and were assigned to the Table Reject column.

The Error and Reject Rates are now:

After examining the results analyzed by the number of connected components on the page (see Table 4.6), it can clearly be seen that a reject zone for any page with 200 connected components or less has to be implemented.

The rationale behind this decision takes root in that this classifier is based in measured ratios. Pages with a low number of connected components are not ``stable'' enough to present credible ratios, since a little variation can result in a very high (or low) ratio. Therefore, having a cutoff number is a requirement to make the classifier robust.

The results obtained after applying this new filter are shown in Table 4.7, where no ``BG'' misclassifications exist.

These last results produce the following Error / Reject Rates:

In this case, however, there are no ``B G'' misclassifications, which was a desired goal with regards to quality control.

There are a considerable number of ``GB'' misclassifications. After evaluating the misclassified pages as well as the triggered rules that produced the misclassifications, the following considerations are in order:



Next: Higher Good Thresholds Up: Results and Analysis Previous: Unfiltered Results