amazon-web-servicesamazon-textract

Amazon Textract without using Amazon S3


I want to extract information from PDFs using Amazon Textract (as in How to use the Amazon Textract with PDF files). All the answers and the AWS documentation requires the input to be Amazon S3 objects.

Can I use Textract without uploading the PDFs to Amazon S3, but just giving them in the REST call? (I have to store the PDFs locally).


Solution

  • I will answer this question with the Java API in mind. The short answer is Yes.

    If you look at this TextractAsyncClient Javadoc for a given operation:

    https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/textract/TextractAsyncClient.html#analyzeDocument-software.amazon.awssdk.services.textract.model.AnalyzeDocumentRequest-

    It states:

    " Documents for asynchronous operations can also be in PDF format"

    This means - you can reference a PDF document and create an AnalyzeDocumentRequest object like this (without pulling from an Amazon S3 bucket). :

    public static void analyzeDoc(TextractClient textractClient, String sourceDoc) {
    
            try {
                InputStream sourceStream = new FileInputStream(new File(sourceDoc));
                SdkBytes sourceBytes = SdkBytes.fromInputStream(sourceStream);
    
                // Get the input Document object as bytes
                Document myDoc = Document.builder()
                        .bytes(sourceBytes)
                        .build();
    
                List<FeatureType> featureTypes = new ArrayList<FeatureType>();
                featureTypes.add(FeatureType.FORMS);
                featureTypes.add(FeatureType.TABLES);
    
                AnalyzeDocumentRequest analyzeDocumentRequest = AnalyzeDocumentRequest.builder()
                        .featureTypes(featureTypes)
                        .document(myDoc)
                        .build();
    
    // Use the TextractAsyncClient to perform an operation like analyzeDocument
    
    ...
    }