Size Information



Next: Preliminary Set of Up: Feature Metrics Design Previous: Broken-Chars Zone

Size Information

In addition to the two previous measures, the classifier incorporates two more preventive measures based on the connected components' size.

The rationale behind these heuristics is that pages which contain too many big connected components (black or white) are more OCR error prone than those that do not.

Large black connected components throughout the whole page can be the result of touching characters, a very large font, or complex vertical touching patterns. All of these characteristics pose difficulty to OCR algorithms.

Large white connected components, similarly, can be the product of large fonts, inverse video or complex touching patterns.

The classifier measures this information by taking the maximum of the average width and average height of the CCs on a page, for both black and white connected components.