ocrpython-tesseract

How do I limit the size (in pixels) of characters to be recognized by PyTesseract?


I am trying to use PyTesseract for recognizing text from some scanned documents, which contain simple text as well as complicated diagrams.

PyTesseract misinterprets the diagrams (or parts of it) as characters, which I do not want to happen. A solution to this problem would be to limit the maximum size (width, or height) of characters to be recognized, and ignore the rest of the larger characters (i.e.-the diagrams)

Is there any way I can limit the maximum size of characters to be recognized?

PyTesseract version - LooseVersion ('5.0.0-alpha.20200328'); Python 3.8.5.


Solution

  • I found a trivial solution to this problem, so I am posting it here. I used pytesseract.image_to_boxes

        data = pytesseract.image_to_boxes(img)
    
        boxes = re.split(' ', data)
    
        line = list()
        coords = list()
        letters = [boxes[0]]
        for i in range(1, len(boxes)):
            if (i%5 != 0):
                line.append(int(boxes[i]))
            else:
                letters.append(boxes[i][2:])
                coords.append(line)
                line = []
    
        for i in range(0, len(coords)):
            print(letters[i], coords[i])
    

    The above code segregates the letters and the coordinates into two respective lists. After this I used the condition

        if (abs(coords[i][2] - coords[i][0]) < size) and (abs(coords[i][3] - coords[i][1]) < size+5):
    

    to filter out the required characters