Concept Exploration Observations
Next: Connected Components
Up: Concept Exploration Dataset
Previous: Concept Exploration Dataset
After visually examining this small dataset very closely, the
following observations were made:
- Observation 1.
- Pages with characters whose strokes are thick
tend to have many of their characters touching. This touching causes
OCR errors [10][3]. Another by-product of ``fat''
characters is that, often, the holes (``lakes'') in letters like
``a'', ``e'', etc, get filled up completely or present only a minimal
white portion in the center. This last fact is also a known cause of
OCR errors since, for instance, many times letters like ``e'' are
classified as ``c'' because of filled lakes. A metric that could
capture the existence of these ``minimally open'' holes would be a
good way to measure the image quality of fat
characters. Figure 3.1 shows an example of these type of
characters.
Figure 3.1: White Speckle in fat characters
- Observation 2.
- Pages with light characters or low contrast
usually have their characters broken in pieces [3]. These
pieces tend to be small and could have almost any shape. A metric that
could weight the existence of these ``broken pieces'' would be a good
estimator of image quality for broken-character
pages. Figure 3.2 shows a portion of a broken-characters
image.
Figure 3.2: Broken characters
- Observation 3.
- Pages with ``inverse video'' (white letters on
black background, see Figure 3.3) or with unusual
typesetting (Figure 3.4) tend to produce more OCR
errors. Some type of threshold on the font size information would be a
good way of predicting the quality of the page from an OCR point of
view.
Figure 3.3: Inverse-video Image and corresponding OCR output
Mystery Mayor |
He's Got 40,000 Books, |
F ' d |
rl e n s All Over Town and a |
.epu ion Soft Touch. |
' k-Tak |
He's aR1 se r and Problem- |
Solve~. Yet He C~n Be Ab S e n t- |
M' d d |
in e , Inarticulate, |
Gontradictory and |
Downright Sloppy. Can an Entrepreneur- |
Pol' ' ' |
Turned-ltlclan |
Lead L.A.? . By Faye FI ore |
Figure 3.4: Unusually Typesetted Image and Corresponding OCR Output
- Observation 4.
- Pages with characters that have gaps in their
stroke are usually problematic for OCR algorithms. These gaps are
usually very small in comparison to the
stroke-width. Figure 3.5 shows the image of a real word
with many broken characters with arrows pointing to these
``micro-gaps.''
Figure 3.5: Micro-Gaps in Broken Characters
- Observation 5.
- Pages with characters that are not touching
each other but occupy the same horizontal space or pages with
fragmented/broken characters tend to produce more OCR errors. These
type of characters produce Connected Component boxes (see the Connected Components section below) that overlap each other (See
Figure 3.6). This type of characteristic is commonplace
in pages with italic or slanted typefaces and in pages
with seriously fragmented characters.
Figure 3.6: Overlapped CC Boxes in Slanted and Broken Chars
- Observation 6.
- The degree of skew of a page is also a good
predictor of OCR performance. As shown in [12], more than one
degree of skew can cause problems for OCR algorithms.
In this research, the first three observations were selected for
further study and ultimately used in the classifier. The rest of the
observations can probably lead to very good page quality features but
were eliminated from consideration because of the complexity involved
in measuring them.
Next: Connected Components
Up: Concept Exploration Dataset
Previous: Concept Exploration Dataset