Text reading with Tesseract in a noisy image

I have these two images:

the first one has clearly an higher quality than the second one (even if it hasn't such a bad quality). I process the two images with OpenCV in order to read the text with Tesseract like that:

import tesseract
import cv2

img = cv2.cvtColor(scr_crop, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(img, 220, 255, cv2.THRESH_BINARY)[1]

# Create custom kernel
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
# Perform closing (dilation followed by erosion)
close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)

# Invert image to use for Tesseract
result = 255 - close

# result = cv2.resize(result, (0, 0), fx=2, fy=2)

text = pytesseract.image_to_string(result, lang="ita")

So I perform first a dilation and then an erosion for the gray-scaled versions of the two images obtaining these two results

So, as you can see, for the first image I obtain a great result and tesseract is able to read the text while I obtain a bad result for the second image and tesseract is not able to read the text. How can I improve the quality of the second image in order to obtain a better result for tesseract?

Solution

For the second image, just apply only thresholding with different threshold types.

Instead of cv2.THRESH_BINARY, use cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU

Image will become:

and if you read:

txt = pytesseract.image_to_string(threshold)
print(txt)

Result will be:

Esiti Positivi: 57

Esiti Negativi: 1512
Numerosita: 1569

Tasso di Conversione: 3.63%

Now what does cv2.THRESH_BINARY_INV and cv2.THRESH_OTSU means?

cv2.THRESH_BINARY_INV is the opposite operation of the cv2.THRESH_BINARY if the current pixel value is greater than the threshold set to the 0. maxval ((255 in our case), otherwise.

source

cv2.THRESH_OTSU finds the optimal threshold value using the OTSU's algorithm. [page 3]

Code:

import cv2
import pytesseract

img = cv2.imread("c7xq9.png")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.threshold(gry, 220, 255, cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)[1]
txt = pytesseract.image_to_string(thr)
print(txt)
cv2.imshow("thr", thr)
cv2.waitKey(0)