I'm using the NodeJS version of the library "amazon-textract-response-parser": "^0.4.1"
My process is:
StartDocumentAnalysisCommand
with params{
DocumentLocation: {
S3Object: {
Bucket: inputBucket,
Name: ApplicationPath,
},
},
FeatureTypes: ["TABLES", "FORMS", "LAYOUT"],
OutputConfig: {
S3Bucket: outputBucket,
},
}
GetDocumentAnalysisCommand
(I realize the cost implications here, working on a POC)async function pollForCompletion({ JobId }: { JobId: string }) {
const { JobStatus, StatusMessage } = await textractClient.send(
new GetDocumentAnalysisCommand({
JobId,
MaxResults: 1000,
})
);
// 15 second polling
if (JobStatus === "IN_PROGRESS") {
console.log("...");
await delay(15000);
await pollForCompletion({ JobId });
} else {
console.log(`Status: ${JobStatus}`);
console.log(`Message: ${StatusMessage}`);
}
}
outputBucket
async function getOutputFromS3({ JobId }: { JobId: string }) {
const outputDir = `textract_output/${JobId}`;
const { Contents = [] } = await s3Client.send(
new ListObjectsCommand({ Bucket: outputBucket, Prefix: outputDir })
);
await Promise.all(
Contents.map(async ({ Key }) => {
if (!Key?.includes(".s3_access_check")) {
console.log({ Key });
const cmd = new GetObjectCommand({ Bucket: outputBucket, Key });
const { Body } = await s3Client.send(cmd);
const jsonString = await Body?.transformToString();
mkdirSync(`./${outputDir}`, { recursive: true });
if (jsonString?.length) writeFileSync(`./${Key}.json`, jsonString);
}
})
);
return { outputDir };
}
function loadAllOutputFilesIntoTextractDocument({
outputDir,
}: {
outputDir: string;
}) {
const collectedResponses = readdirSync(outputDir).map<
ApiAsyncJobOuputSucceded[]
>((filePath) =>
JSON.parse(readFileSync(`${outputDir}/${filePath}`, { encoding: "utf-8" }))
);
return new TextractDocument(collectedResponses as unknown as ApiResponsePage);
}
But I get this error:
loadAllFilesIntoTextractDocument:
Error: Missing parser item for block ID 8cc77710-e530-4208-836e-45043dc93411
at Page.getItemByBlockId (<removed>/node_modules/amazon-textract-response-parser/src/document.ts:300:13)
at FieldKeyGeneric.listWords (<removed>/node_modules/amazon-textract-response-parser/src/content.ts:324:38)
at FieldKeyGeneric.get text [as text] (<removed>/node_modules/amazon-textract-response-parser/src/content.ts:341:19)
at <removed>/node_modules/amazon-textract-response-parser/src/form.ts:317:34
at Array.forEach (<anonymous>)
at new FormGeneric (<removed>/node_modules/amazon-textract-response-parser/src/form.ts:314:15)
at Page._parse (<removed>/node_modules/amazon-textract-response-parser/src/document.ts:272:18)
at new Page (<removed>/node_modules/amazon-textract-response-parser/src/document.ts:227:10)
at <removed>/node_modules/amazon-textract-response-parser/src/document.ts:1495:28
at Array.forEach (<anonymous>)
What I take this to mean is that the output from the Textract operation hasn't maintained Block ID consistency across all the files created... though I did see this In most cases message in the amazon-textract-response-parser
README:
In most cases, providing an array of response objects is also supported (for use when a large Amazon Textract response was split/paginated).
Am I missing something in the Textract operation parameters that would fix those IDs? Or is there something else needed when instantiating the TextractDocument
? Or do I need to pass it the raw, paginated response from GetDocumentAnalysisCommand
in order to work? I thought that would be strange considering there are mutation functions available with amazon-textract-response-parser
.
Thanks in advance.
I also opened an issue for the library
I had a very helpful response from a contributor on the library in an issue I created.
TLDR: readdirSync
output needs to be sorted properly by the file number (default is alphabetical, "1.json", "11.json", "2.json", etc). You need to instantiate the TextractDocument with an array of responses in the correct order to maintain ID associations across the files.