pythonocr

How to return aligned properly text from skewed image?


I am using doctr to perform OCR on an skewed image. Something like: enter image description here

Although the OCR accurately recognizes the words, the text returned is organized based on the coordinates of the skewed image. As a result, when I try to combine the words into a coherent string, the output becomes complete gibberish.

The main problem is in the lines:

World War ml or the
was a global conflict Second World War (1
the world's
between two coalitions: September 1939 - 2

Instead of extracting the actual lines as they appear in the image, I think it takes a horizontal alignment.

I tried sorting in different ways, and using different parameters, but all of them fall short.

What I found to work, is to de-skew the image with another library and then ocr the result. enter image description here

And now, I get proper lines:

World War ml or the Second World War (1 September 1939 - 2 September 1945)
was a global conflict between two coalitions: the Allies and the Axis powers. Nearly all
the worid's countres--including all the great powers--particpated, with many investing  
all available economic, industrial, and scientific capabilities in pursuit of total war,

This works, but I am certain, that there must be a way to directly do this from doctr without de-skewing first.

Code:

def read_pdf(file_path):
    model = ocr_predictor(
        det_arch='db_resnet50',
        reco_arch='crnn_vgg16_bn',
        pretrained=True,
        export_as_straight_boxes=True,
        detect_orientation=True
    )

    doc = DocumentFile.from_pdf(file_path)
    result = model(doc)

    full_text = []
    for page in result.pages:
        page_text = []
        for block in page.blocks:
            for line in block.lines:
                line_text = ' '.join([word.value for word in sorted(line.words, key=lambda w: w.geometry[0][0])])
                page_text.append("\n" + line_text)

        full_text.append(' '.join(page_text))

    return ' '.join(full_text)

Solution

  • The answer was setting straighten_pages=True, as a parameter to the model.

    EX:

    model = ocr_predictor(
        det_arch="fast_base",
        reco_arch="parseq",
        pretrained=True,
        straighten_pages=True,  # This corrects deskew under the hood
    )