Features Observed but not Used

The use of new and more complex features would be a key component in a production-type classifier. Specifically if the user is interested in higher ``good thresholds'', more features will be needed for the filter to use.

While performing the research for this thesis, several features were discovered which could be useful in a full-blown system. These features along with a brief explanation of their significance follow:

Overlapping. The amount of overlapping between two neighboring connected components' boxes could serve as an indicator of both the font complexity and of deformed and/or broken characters (see Observation 5 in Chapter 3).
Skew angle. The degree of skew of an image could very well be a strong indicator of the quality of the image. If the skew degree is more than a certain threshold [13] then the page should be considered ``Bad''.
Width distribution. In order to determine the amount of touchiness in a page, the width distribution would probably have to be calculated and information compiled from that calculation.
Micro-Gaps. In this work, very small white blobs were used to estimate the degree of thickness (and therefore touchiness) the characters have. In the same vein, a metric to detect micro-gaps would account for a large amount the broken characters and also for the degree of character complexity (see Observation 4 in Chapter 3).
Filled CC boxes. A way to detect completely filled lakes in letters like ``e'', ``a'', etc, would be to measure the black density inside a CC box of certain (minimum) dimensions. Figure 5.1 shows an example of this metric in action for two real-word string of characters. The black densities of each connected component, and for the collection of connected components, are shown for a ``fat'' and a ``normal'' character strings.

Figure 5.1: Black Density for Connected Components
Deformed contours. The degree of complexity of the contours of a character could also be a very good predictor of the font complexity and the paper/scanner quality. Figure 5.2 shows a ``well-formed'' and a ``deformed'' character.

Figure 5.2: Deformed and Well-Formed Characters

Next: New Datasets Up: Future Work Previous: Statistical Pattern Recognition