pythonimage-processingocrtesseractpython-tesseract

Getting the bounding box of the recognized words using python-tesseract


I am using python-tesseract to extract words from an image. This is a python wrapper for tesseract which is an OCR code.

I am using the following code for getting the words:

import tesseract

api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz")
api.SetPageSegMode(tesseract.PSM_AUTO)

mImgFile = "test.jpg"
mBuffer=open(mImgFile,"rb").read()
result = tesseract.ProcessPagesBuffer(mBuffer,len(mBuffer),api)
print "result(ProcessPagesBuffer)=",result

This returns only the words and not their location/size/orientation (or in other words a bounding box containing them) in the image. I was wondering if there is any way to get that as well


Solution

  • Use pytesseract.image_to_data()

    import pytesseract
    from pytesseract import Output
    import cv2
    img = cv2.imread('image.jpg')
    
    d = pytesseract.image_to_data(img, output_type=Output.DICT)
    n_boxes = len(d['level'])
    for i in range(n_boxes):
        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
    
    cv2.imshow('img', img)
    cv2.waitKey(0)
    

    Among the data returned by pytesseract.image_to_data():

    The bounding boxes returned by pytesseract.image_to_boxes() enclose letters so I believe pytesseract.image_to_data() is what you're looking for.