pythonopencvocrpython-tesseract

How to OCR a text with white colour characters on a blue background from a cropped image?


First, I want to crop an image using a mouse event, and then print the text inside the cropped image. I tried OCR scripts but all can't work for this image attached below. I think the reason is that the text has white characters on blue background.

Can you help me with doing this?

Full image:

The full image:

Cropped image:

enter image description here

An example what I tried is:

import pytesseract
import cv2
import numpy as np

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'

img = cv2.imread('D:/frame/time 0_03_.jpg')

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
adaptiveThresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 35, 30)
inverted_bin=cv2.bitwise_not(adaptiveThresh)

#Some noise reduction
kernel = np.ones((2,2),np.uint8)
processed_img = cv2.erode(inverted_bin, kernel, iterations = 1)
processed_img = cv2.dilate(processed_img, kernel, iterations = 1)
 
#Applying image_to_string method
text = pytesseract.image_to_string(processed_img)
 
print(text)

Solution

  • [EDIT]

    For anyone wondering, the image in the question was updated after posting my answer. That was the original image:

    Input 1

    Thus, the below output in my original answer.

    That's the newly posted image:

    Input 2

    The specific Turkish characters, especially in the last word, are still not properly detected (since I still can't use lang='tur' right now), but at least the Ö and Ü can be detected using lang='deu', which I have installed:

    text = pytesseract.image_to_string(mask, lang='deu').strip().replace('\n', '').replace('\f', '')
    print(text)
    # GÖKYÜZÜ AVCILARI ILE TEKE TEK KLASIGI
    

    [/EDIT]


    I wouldn't use cv2.adaptiveThreshold here, but simple cv2.threshold using cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV. Since, the comma touches the image border, I'd add another, one pixel wide border via cv2.copyMakeBorder to capture the comma properly. So, that would be the full code (replacing \f is due to my pytesseract version only):

    import cv2
    import pytesseract
    
    img = cv2.imread('n7nET.jpg')
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    mask = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1]
    mask = cv2.copyMakeBorder(mask, 1, 1, 1, 1, cv2.BORDER_CONSTANT, 0)
    text = pytesseract.image_to_string(mask).strip().replace('\n', '').replace('\f', '')
    print(text)
    # 2020'DE SALGINI BILDILER, YA 2021'DE?
    

    The output seems correct to me – of course, not for this special (I assume Turkish) capital I character with the dot above. Unfortunately, I can't run pytesseract.image_to_string(..., lang='tur'), since it's simply not installed. Maybe, have a look at that to get the proper characters here as well.

    ----------------------------------------
    System information
    ----------------------------------------
    Platform:      Windows-10-10.0.16299-SP0
    Python:        3.9.1
    PyCharm:       2021.1.1
    OpenCV:        4.5.1
    pytesseract:   5.0.0-alpha.20201127
    ----------------------------------------