The other important problem for OCR algorithms seems to be broken characters. The Broken Character Factor is designed to measure the amount of broken characters in a given image (Observation 2). In general, the sizes and shapes of character fragments vary widely. Thus, their CC boxes will have many different widths and heights. In the WH-Map of a page with broken characters, these ``broken'' CC boxes will appear near the (width=0, height=0) vertex of the graph. Furthermore, taking into account the variations of their shapes, both ``wide'' and ``tall'' boxes are expected. Therefore, broken character pages present a ``broken char zone'' in the WH-Map as shown in Figure 3.10.
Figure 3.10: Broken Char Zone and Other Char Zones
It is important to note that the broken char zone is designed to collect all small connected components. These small connected components are mostly the product of broken characters but can also be dots and, in small typesizes, other small legal characters. From a second look at Figure 3.10, it is observed that the location of the ``Dots zone'' is completely inside the broken char area. This means that all the dots in the page, such as a period and the dot of 'i', will generate connected components that will fall inside the broken characters zone.
A density measurement is sensitive to the distribution of characters in the page. Therefore, it is not a realiable estimator of the number of broken characters' pieces present in the image, because a page containing a large number of dots or other small legal characters would have a high density in the Broken Characters Zone. Therefore, the coverage of the broken char zone is of interest instead of its density.
To measure the degree of covering of the Broken Char Zone the following method is used: the zone is divided into square cells, at a rate of one per square pixel; then, the CC boxes are allocated to these cells according to their width and height. After all the CC boxes are allocated, the Broken Char Factor is computed as:
A measure such as this one effectively removes the error of considering a zone with a large amount of dots or other small characters as a broken-char page suspect.
The broken char zone must be defined independent of any font-specific characteristics so that it can be reliable when used in pages with different fonts and typesizes. To define the broken char zone, a way of normalizing its dimensions and registering it inside the WH-Map must be determined. A standard way of registering planar information is to determine a single point in the plane from the available data and then define all subsequent plane mappings with regard to this ``anchor point''. Two approaches for determining the anchor point were examined:
Results are described in the section ``Determining Threshold Values''.
After defining the reference point, the shape and boundaries of the broken chars zone must be defined. As suggested by experimental observations, the general shape of the broken chars zone should be a rectangle aligned with the width-height diagonal and thick enough to allow for ``wide'' and ``tall'' broken pieces.