My task is to identify and extract text with strikethrough symbols from an image. I want to select only the words that have this symbol and place each instance in a list.
Image Containing Strikethrough Text
Code I have tried:
from PIL import Image
import pytesseract
# Open the image file
img_path = 'path/to/image.png'
img = Image.open(img_path)
# Use tesseract to do OCR on the image
text = pytesseract.image_to_string(img)
text
The issue is that the output includes all words with no sign of a strikethrough symbol. If the string contained an indicator of a strkethrough word or phrase, such as '-', then I could further process it; however, regular pytesseract will not detect the strikethrough in this image.
A better approach will be needed.
Example output: ['Once upon a time', 'Jack', 'village']
Some partial success extracting the words by looking at the confidence intervals, though the strikethrough also creates inaccuracies. This could be ameliorated by looking at the bounding box and using something like openCV to clean up the strikethrough.
# Open the image file
img_path = 'path/KrDdO.png'
img = Image.open(img_path)
# Use tesseract to do OCR on the image
text = pytesseract.image_to_data(img, output_type = 'dict')
for word, conf in zip(text['text'], text['conf']):
if 0 < conf < 93:
print(word, conf)
Output:
Onceupon-atime, 72
Jaek 91
viage 31