amazon-web-servicesamazon-textract

AWS textract - UnsupportedDocumentException


While implementing aws textract using boto3 for python.

Code:

import boto3

# Document
documentName = "/home/niranjan/IdeaProjects/amazon-forecast-samples/notebooks/basic/Tutorial/cert.pdf"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

print(type(imageBytes))

# Amazon Textract client
textract = boto3.client('textract', region_name='us-west-2')

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

below are credential and config files of aws

niranjan@niranjan:~$ cat ~/.aws/credentials
[default]
aws_access_key_id=my_access_key_id
aws_secret_access_key=my_secret_access_key

niranjan@niranjan:~$ cat ~/.aws/config 
[default]
region=eu-west-1

I am getting this exception:

---------------------------------------------------------------------------
UnsupportedDocumentException              Traceback (most recent call last)
<ipython-input-11-f52c10e3f3db> in <module>
     14 
     15 # Call Amazon Textract
---> 16 response = textract.detect_document_text(Document={'Bytes': imageBytes})
     17 
     18 #print(response)

~/venv/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    314                     "%s() only accepts keyword arguments." % py_operation_name)
    315             # The "self" in this scope is referring to the BaseClient.
--> 316             return self._make_api_call(operation_name, kwargs)
    317 
    318         _api_call.__name__ = str(py_operation_name)

~/venv/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    624             error_code = parsed_response.get("Error", {}).get("Code")
    625             error_class = self.exceptions.from_code(error_code)
--> 626             raise error_class(parsed_response, operation_name)
    627         else:
    628             return parsed_response

UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the DetectDocumentText operation: Request has unsupported document format

I am bit new to AWS textract, any help would be much appreciated.


Solution

  • As DetectDocumentText API of Textract does not support "pdf" type of document, sending pdf you encounter UnsupportedDocumentFormat Exception. Try to send image file instead.

    Incase if you still want to send pdf file then you have to use Asynchronous APIs of Textract. E.g. StartDocumentAnalysis API to start analysis and GetDocumentAnalysis to get analyzed document.

    Detects text in the input document. Amazon Textract can detect lines of text and the words that make up a line of text. The input document must be an image in JPEG or PNG format. DetectDocumentText returns the detected text in an array of Block objects.

    https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html