amazon-web-servicespdfamazon-textract

Is there any limit on number of pdf pages to be OCRed using AWS Textract?


I am OCRing image based pdfs using AWS Textract

my each PDF I have has 60+ pages

but when I try to OCR the pdf file it only does that for the first 4 pages of each file.

is there any limit on number of pages in the pdf file for AWS extract

I found this https://docs.aws.amazon.com/textract/latest/dg/limits.html

but it does not mention any limit on the number of pages!!

Any one know if there is any limit of the pdf pages?

and if so, how can I do the OCR for the whole file 60+ pages?


Solution

  • The hard limits for textract are 1000 pages or 500mb for PDFs.

    I think that your problem is related to the batch response of textract. You have to look if the key "NextToken" in the json output is populated and if so, you have to make another request with that token.