node.jsamazon-textract

amazon-textract-response-parser: Unable to construct TextractDocument with multi-page output downloaded from S3


I'm using the NodeJS version of the library "amazon-textract-response-parser": "^0.4.1"

My process is:

  1. StartDocumentAnalysisCommand with params
{
      DocumentLocation: {
        S3Object: {
          Bucket: inputBucket,
          Name: ApplicationPath,
        },
      },
      FeatureTypes: ["TABLES", "FORMS", "LAYOUT"],
      OutputConfig: {
        S3Bucket: outputBucket,
      },
  }
  1. Poll for completion with GetDocumentAnalysisCommand (I realize the cost implications here, working on a POC)
async function pollForCompletion({ JobId }: { JobId: string }) {
  const { JobStatus, StatusMessage } = await textractClient.send(
    new GetDocumentAnalysisCommand({
      JobId,
      MaxResults: 1000,
    })
  );

  // 15 second polling
  if (JobStatus === "IN_PROGRESS") {
    console.log("...");
    await delay(15000);
    await pollForCompletion({ JobId });
  } else {
    console.log(`Status: ${JobStatus}`);
    console.log(`Message: ${StatusMessage}`);
  }
}
  1. Download the results from the outputBucket
async function getOutputFromS3({ JobId }: { JobId: string }) {
  const outputDir = `textract_output/${JobId}`;

  const { Contents = [] } = await s3Client.send(
    new ListObjectsCommand({ Bucket: outputBucket, Prefix: outputDir })
  );

  await Promise.all(
    Contents.map(async ({ Key }) => {
      if (!Key?.includes(".s3_access_check")) {
        console.log({ Key });

        const cmd = new GetObjectCommand({ Bucket: outputBucket, Key });
        const { Body } = await s3Client.send(cmd);

        const jsonString = await Body?.transformToString();

        mkdirSync(`./${outputDir}`, { recursive: true });
        if (jsonString?.length) writeFileSync(`./${Key}.json`, jsonString);
      }
    })
  );

  return { outputDir };
}
  1. Parse the downloaded output files and load them with TextractDocument, typecasting the array passed as per the suggestion
function loadAllOutputFilesIntoTextractDocument({
  outputDir,
}: {
  outputDir: string;
}) {
  const collectedResponses = readdirSync(outputDir).map<
    ApiAsyncJobOuputSucceded[]
  >((filePath) =>
    JSON.parse(readFileSync(`${outputDir}/${filePath}`, { encoding: "utf-8" }))
  );

  return new TextractDocument(collectedResponses as unknown as ApiResponsePage);
}

But I get this error:

loadAllFilesIntoTextractDocument:
Error: Missing parser item for block ID 8cc77710-e530-4208-836e-45043dc93411
    at Page.getItemByBlockId (<removed>/node_modules/amazon-textract-response-parser/src/document.ts:300:13)
    at FieldKeyGeneric.listWords (<removed>/node_modules/amazon-textract-response-parser/src/content.ts:324:38)
    at FieldKeyGeneric.get text [as text] (<removed>/node_modules/amazon-textract-response-parser/src/content.ts:341:19)
    at <removed>/node_modules/amazon-textract-response-parser/src/form.ts:317:34
    at Array.forEach (<anonymous>)
    at new FormGeneric (<removed>/node_modules/amazon-textract-response-parser/src/form.ts:314:15)
    at Page._parse (<removed>/node_modules/amazon-textract-response-parser/src/document.ts:272:18)
    at new Page (<removed>/node_modules/amazon-textract-response-parser/src/document.ts:227:10)
    at <removed>/node_modules/amazon-textract-response-parser/src/document.ts:1495:28
    at Array.forEach (<anonymous>)

What I take this to mean is that the output from the Textract operation hasn't maintained Block ID consistency across all the files created... though I did see this In most cases message in the amazon-textract-response-parser README:

In most cases, providing an array of response objects is also supported (for use when a large Amazon Textract response was split/paginated).

Am I missing something in the Textract operation parameters that would fix those IDs? Or is there something else needed when instantiating the TextractDocument? Or do I need to pass it the raw, paginated response from GetDocumentAnalysisCommand in order to work? I thought that would be strange considering there are mutation functions available with amazon-textract-response-parser.

Thanks in advance.

I also opened an issue for the library


Solution

  • I had a very helpful response from a contributor on the library in an issue I created.

    TLDR: readdirSync output needs to be sorted properly by the file number (default is alphabetical, "1.json", "11.json", "2.json", etc). You need to instantiate the TextractDocument with an array of responses in the correct order to maintain ID associations across the files.