pdfocrtext-processingamazon-textract

AWS Textract OCR reading PDF as single line instead of preserving line breaks


Hi I am new to AWS Textract.

Input Pdf file

I am using Amazon Textract to extract text from a PDF file. However, the output is not preserving the line breaks from the original PDF.

For example, in the PDF there are separate lines like:

Seller

Buyer

But in the Textract output, it is reading it as: Seller: Buyer:

Instead of separate lines, the text is concatenated into a single string. enter image description here

I would like Textract to retain the line breaks and structure from the original PDF. The lines denote different sections so I need to preserve that formatting.

Is there any way to configure Textract to output multi-line strings instead of concatenating everything into a single line? Or does it require post-processing the Textract result to split it based on the line breaks?

Any suggestions on how to properly extract text from a PDF while keeping the original line structure would be appreciated.


Solution

  • You're in luck, as Textract just released a new feature which might be helpful to your use case specifically.

    Amazon Textract launches Layout feature to extract paragraphs, titles, and more from documents

    enter image description here


    enter image description here