I am trying to use PyTesseract
for recognizing text from some scanned documents, which contain simple text as well as complicated diagrams.
PyTesseract
misinterprets the diagrams (or parts of it) as characters, which I do not want to happen. A solution to this problem would be to limit the maximum size (width
, or height
) of characters to be recognized, and ignore the rest of the larger characters (i.e.-the diagrams)
Is there any way I can limit the maximum size of characters to be recognized?
PyTesseract version - LooseVersion ('5.0.0-alpha.20200328'); Python 3.8.5
.
I found a trivial solution to this problem, so I am posting it here. I used pytesseract.image_to_boxes
data = pytesseract.image_to_boxes(img)
boxes = re.split(' ', data)
line = list()
coords = list()
letters = [boxes[0]]
for i in range(1, len(boxes)):
if (i%5 != 0):
line.append(int(boxes[i]))
else:
letters.append(boxes[i][2:])
coords.append(line)
line = []
for i in range(0, len(coords)):
print(letters[i], coords[i])
The above code segregates the letters and the coordinates into two respective lists. After this I used the condition
if (abs(coords[i][2] - coords[i][0]) < size) and (abs(coords[i][3] - coords[i][1]) < size+5):
to filter out the required characters