pythonocropencvtext-recognition

How to detect non-contiguous symbols using CV2?


I have a grayscale image of printed text. I want to extract every individual character from the image so that I can save them as discrete images. I don't want to recognise what the character is, I just want each glyph as a separate file.

I'm using cv2, for example:

# Find contours to isolate individual letters
contours, _ = cv2.findContours(binary_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)`

That works perfectly for contiguous characters - that is, where the shape of the glyph has no breaks.

But it doesn't work on characters like i, j, :, and ; - the dots on top are not included.

Is there a way to use CV2 to detect these characters? I know the document uses only Latin letters, numbers, and punctuation.

The document uses a fairly archaic typeface and doesn't work well with Tesseract or other traditional OCR engines - which is why I want to detect the individual letters, rather than try to recognise them.


Solution

  • I used OpenCV's Erode / Dilate function to erode the image vertically.

    kernel = np.array([[0, 0, 0, 0, 0],
                       [0, 0, 1, 0, 0],
                       [0, 0, 1, 0, 0],
                       [0, 0, 1, 0, 0],
                       [0, 0, 0, 0, 0]], dtype=np.uint8)
    
    erode = cv2.erode(image, kernel, iterations = 6)
    

    That transformed this:

    Old printed text

    Into this:

    Text which has been vertically deformed

    That joined the dots on the i and ? characters while leaving enough horizontal space to make detection possible.

    I did the detection on the eroded image, but applied the cropping to the original image.