I have a grayscale image of printed text. I want to extract every individual character from the image so that I can save them as discrete images. I don't want to recognise what the character is, I just want each glyph as a separate file.
I'm using cv2
, for example:
# Find contours to isolate individual letters
contours, _ = cv2.findContours(binary_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)`
That works perfectly for contiguous characters - that is, where the shape of the glyph has no breaks.
But it doesn't work on characters like i
, j
, :
, and ;
- the dots on top are not included.
Is there a way to use CV2 to detect these characters? I know the document uses only Latin letters, numbers, and punctuation.
The document uses a fairly archaic typeface and doesn't work well with Tesseract or other traditional OCR engines - which is why I want to detect the individual letters, rather than try to recognise them.
I used OpenCV's Erode / Dilate function to erode the image vertically.
kernel = np.array([[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 0, 0]], dtype=np.uint8)
erode = cv2.erode(image, kernel, iterations = 6)
That transformed this:
Into this:
That joined the dots on the i
and ?
characters while leaving enough horizontal space to make detection possible.
I did the detection on the eroded image, but applied the cropping to the original image.