I am using doctr to perform OCR on an skewed image. Something like:
Although the OCR accurately recognizes the words, the text returned is organized based on the coordinates of the skewed image. As a result, when I try to combine the words into a coherent string, the output becomes complete gibberish.
The main problem is in the lines:
World War ml or the
was a global conflict Second World War (1
the world's
between two coalitions: September 1939 - 2
Instead of extracting the actual lines as they appear in the image, I think it takes a horizontal alignment.
I tried sorting in different ways, and using different parameters, but all of them fall short.
What I found to work, is to de-skew the image with another library and then ocr the result.
And now, I get proper lines:
World War ml or the Second World War (1 September 1939 - 2 September 1945)
was a global conflict between two coalitions: the Allies and the Axis powers. Nearly all
the worid's countres--including all the great powers--particpated, with many investing
all available economic, industrial, and scientific capabilities in pursuit of total war,
This works, but I am certain, that there must be a way to directly do this from doctr without de-skewing first.
Code:
def read_pdf(file_path):
model = ocr_predictor(
det_arch='db_resnet50',
reco_arch='crnn_vgg16_bn',
pretrained=True,
export_as_straight_boxes=True,
detect_orientation=True
)
doc = DocumentFile.from_pdf(file_path)
result = model(doc)
full_text = []
for page in result.pages:
page_text = []
for block in page.blocks:
for line in block.lines:
line_text = ' '.join([word.value for word in sorted(line.words, key=lambda w: w.geometry[0][0])])
page_text.append("\n" + line_text)
full_text.append(' '.join(page_text))
return ' '.join(full_text)
The answer was setting straighten_pages=True
, as a parameter to the model.
EX:
model = ocr_predictor(
det_arch="fast_base",
reco_arch="parseq",
pretrained=True,
straighten_pages=True, # This corrects deskew under the hood
)