Hi I am new to AWS Textract.
I am using Amazon Textract to extract text from a PDF file. However, the output is not preserving the line breaks from the original PDF.
For example, in the PDF there are separate lines like:
Seller
Buyer
But in the Textract output, it is reading it as: Seller: Buyer:
Instead of separate lines, the text is concatenated into a single string.
I would like Textract to retain the line breaks and structure from the original PDF. The lines denote different sections so I need to preserve that formatting.
Is there any way to configure Textract to output multi-line strings instead of concatenating everything into a single line? Or does it require post-processing the Textract result to split it based on the line breaks?
Any suggestions on how to properly extract text from a PDF while keeping the original line structure would be appreciated.
You're in luck, as Textract just released a new feature which might be helpful to your use case specifically.
Amazon Textract launches Layout feature to extract paragraphs, titles, and more from documents