ocrhocr

Detecting bold (and italic) text in an image


I want to detect stretches of bold (and perhaps italic) text in images of pages--think TIFFs, or image PDFs. I need pointers to any open source software that does that.

Here's a picture of a dictionary entry (from a Tzeltal--Spanish dictionary) illustrating such text:

enter image description here

First line has bold, then italics, then "normal"; second has a couple words in bold, then a couple in normal font. The formatting represents implicit structure: bold is for headwords, italics is for part of speech, and normal is for most other things. Without knowing what's bold/ italic/ normal, it's impossible to parse these entries into structured text (like XML).

When our dictionary parsing project was active several years ago, we were using Tesseract version 3 to OCR the images, with the hocr output to give us positional information on the page (crucial to e.g. separating out different entries in the dictionary). The hocr output also included tags 'strong' for bold and 'em' for italics. While the 'em' tagging was reasonably accurate, the 'strong' tagging was almost random. And now version 4 of Tesseract doesn't even try (see also). You can still tell tesseract to use the old engine, but as I say, that seems to be completely inaccurate, at least on the text we fed it.

It doesn't seem like distinguishing bold vs. normal text should be hard; I can stand far away from my monitor and pick out the bold and non-bold stretches even though I can't read the words at that distance. (I suppose telling whether an entire text was bold or non-bold would be harder, but distinguishing them when both appear seems easy--for humans.)

I am told that ABBYY FineReader outputs information on font style, but for various reasons that won't work for our application.

If there were a non-OCR way of distinguishing bold vs. non-bold text that would put bounding boxes around the bold text, we could probably match those stretches up with the bounding boxes for characters/ words that Tesseract outputs (allowing for a few pixels difference). I know there was research on this decades ago (also here), but is there any open source software that actually does it?


Solution

  • I invented some script:

    KERNEL = np.asarray([
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
    ], np.uint8)
    KERNEL_ITALIC = np.asarray([
        [0, 0, 1, 1],
        [0, 0, 1, 1],
        [0, 0, 1, 1],
        [0, 1, 1, 0],
        [0, 1, 1, 0],
        [0, 1, 1, 0],
        [1, 1, 0, 0],
        [1, 1, 0, 0],
        [1, 1, 0, 0],
    ], np.uint8)
    
    def pre_process_italic(img):
        img_f = cv2.flip(img, 1)
    
        img = cv2.erode(img, KERNEL_ITALIC, iterations=1)
        img = cv2.dilate(img, KERNEL, iterations=1)
    
        img_f = cv2.erode(img_f, KERNEL_ITALIC, iterations=1)
        img_f = cv2.dilate(img_f, KERNEL, iterations=1)
        img_f = cv2.flip(img_f, 1)
        return img, img_f
    
    def apply_func_italic(bbox, original, preprocessed):
    
        b_1 = bbox[1]
        b_3 = bbox[3]
        b_0 = bbox[0]
        b_2 = bbox[2]
    
        a, b = np.mean(original[b_1:b_3, b_0:b_2]), np.mean(preprocessed[b_1:b_3, b_0:b_2])
    
        return get_ratio(a, b)
    
    def get_ratio(a, b):
        return ((a - b) / (a + b + 1e-8)) * 2
    

    this python function gets an image with text and make some opencv functions morphing processes. After that the function returns two images: original and processed, after this all what you need is have word's bounding boxes and loop through them and calculate ratio of 'ON' pixels on the original image to processed. There is "get_ratio" - it can be replaced to another metric. I have not found better metric yet.