pythonpython-3.xocrtesseractpytesser

Pytesseract Improve OCR Accuracy


I want to extract the text from an image in python. In order to do that, I have chosen pytesseract. When I tried extracting the text from the image, the results weren't satisfactory. I also went through this and implemented all the techniques listed down. Yet, it doesn't seem to perform well.

Image:

enter image description here

Code:

import pytesseract
import cv2
import numpy as np

img = cv2.imread('D:\\wordsimg.png')

img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)

img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
    
txt = pytesseract.image_to_string(img ,lang = 'eng')

txt = txt[:-1]

txt = txt.replace('\n',' ')

print(txt)

Output:

t hose he large form might light another us should took mountai house n story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was 

Even 1 unwanted space could cost me a lot. I want the results to be 100% accurate. Any help would be appreciated. Thanks!


Solution

  • I changed resize from 1.2 to 2 and removed all preprocessing. I got good results with psm 11 and psm 12

    import pytesseract
    import cv2
    import numpy as np
    
    img = cv2.imread('wavy.png')
    
    #  img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
    img = cv2.resize(img, None, fx=2, fy=2)
    
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    kernel = np.ones((1,1), np.uint8)
    #  img = cv2.dilate(img, kernel, iterations=1)
    #  img = cv2.erode(img, kernel, iterations=1)
    
    #  img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    
    cv2.imwrite('thresh.png', img)
    
    pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
        
    for psm in range(6,13+1):
        config = '--oem 3 --psm %d' % psm
        txt = pytesseract.image_to_string(img, config = config, lang='eng')
        print('psm ', psm, ':',txt)
    

    The config = '--oem 3 --psm %d' % psm line uses the string interpolation (%) operator to replace %d with an integer (psm). I'm not exactly sure what oem does, but I've gotten in the habit of using it. More on psm at the end of this answer.

    psm  11 : those he large form might light another us should name
    
    took mountain story important went own own thought girl
    
    over family look some much ask the under why miss point
    
    make mile grow do own school was
    
    psm  12 : those he large form might light another us should name
    
    took mountain story important went own own thought girl
    
    over family look some much ask the under why miss point
    
    make mile grow do own school was
    

    psm is short for page segmentation mode. I'm not exactly sure what the different modes are. You can get a feel for what the codes are from the descriptions. You can get the list from tesseract --help-psm

    Page segmentation modes:
      0    Orientation and script detection (OSD) only.
      1    Automatic page segmentation with OSD.
      2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
      3    Fully automatic page segmentation, but no OSD. (Default)
      4    Assume a single column of text of variable sizes.
      5    Assume a single uniform block of vertically aligned text.
      6    Assume a single uniform block of text.
      7    Treat the image as a single text line.
      8    Treat the image as a single word.
      9    Treat the image as a single word in a circle.
     10    Treat the image as a single character.
     11    Sparse text. Find as much text as possible in no particular order.
     12    Sparse text with OSD.
     13    Raw line. Treat the image as a single text line,
           bypassing hacks that are Tesseract-specific.