Concept Exploration Observations

Next: Connected Components Up: Concept Exploration Dataset Previous: Concept Exploration Dataset

Concept Exploration Observations

After visually examining this small dataset very closely, the following observations were made:

Observation 1.

Pages with characters whose strokes are thick tend to have many of their characters touching. This touching causes OCR errors [10][3]. Another by-product of ``fat'' characters is that, often, the holes (``lakes'') in letters like ``a'', ``e'', etc, get filled up completely or present only a minimal white portion in the center. This last fact is also a known cause of OCR errors since, for instance, many times letters like ``e'' are classified as ``c'' because of filled lakes. A metric that could capture the existence of these ``minimally open'' holes would be a good way to measure the image quality of fat characters. Figure 3.1 shows an example of these type of characters.

Figure 3.1: White Speckle in fat characters

Observation 2.

Pages with light characters or low contrast usually have their characters broken in pieces [3]. These pieces tend to be small and could have almost any shape. A metric that could weight the existence of these ``broken pieces'' would be a good estimator of image quality for broken-character pages. Figure 3.2 shows a portion of a broken-characters image.

Figure 3.2: Broken characters

Observation 3.

Pages with ``inverse video'' (white letters on black background, see Figure 3.3) or with unusual typesetting (Figure 3.4) tend to produce more OCR errors. Some type of threshold on the font size information would be a good way of predicting the quality of the page from an OCR point of view.

g - g jj

* *

I *

$ $ f

Figure 3.3: Inverse-video Image and corresponding OCR output

Mystery Mayor

He's Got 40,000 Books,

F ' d

rl e n s All Over Town and a

.epu ion Soft Touch.

' k-Tak

He's aR1 se r and Problem-

Solve~. Yet He C~n Be Ab S e n t-

M' d d

in e , Inarticulate,

Gontradictory and

Downright Sloppy. Can an Entrepreneur-

Pol' ' '

Turned-ltlclan

Lead L.A.? . By Faye FI ore

Figure 3.4: Unusually Typesetted Image and Corresponding OCR Output

Observation 4.

Pages with characters that have gaps in their stroke are usually problematic for OCR algorithms. These gaps are usually very small in comparison to the stroke-width. Figure 3.5 shows the image of a real word with many broken characters with arrows pointing to these ``micro-gaps.''

Figure 3.5: Micro-Gaps in Broken Characters

Observation 5.

Pages with characters that are not touching each other but occupy the same horizontal space or pages with fragmented/broken characters tend to produce more OCR errors. These type of characters produce Connected Component boxes (see the Connected Components section below) that overlap each other (See Figure 3.6). This type of characteristic is commonplace in pages with italic or slanted typefaces and in pages with seriously fragmented characters.

Figure 3.6: Overlapped CC Boxes in Slanted and Broken Chars

Observation 6.

The degree of skew of a page is also a good predictor of OCR performance. As shown in [12], more than one degree of skew can cause problems for OCR algorithms.

In this research, the first three observations were selected for further study and ultimately used in the classifier. The rest of the observations can probably lead to very good page quality features but were eliminated from consideration because of the complexity involved in measuring them.

Next: Connected Components Up: Concept Exploration Dataset Previous: Concept Exploration Dataset