amazon-sagemakeramazon-textractamazon-comprehend

Is there a way to show pdf in its original structure in the human review custom entity labelling in aws sagemaker?


I have modified this sample to read PDFs in tabular format. I would like to keep the tabular structure of the original pdf when doing the human review process. I notice the custom worker task template uses the crowd-entity-annotation element which seems to read only texts. I am aware that the human reviewer process reads from an S3 key which contains raw text written by the textract process.

I have been considering writing to S3 using tabulate but I don't think that is the best solution. I would like to keep the structure and still have the ability to annotate custom entities.


Solution

  • Comprehend now natively support to detect custom-defined entities for pdf documents. To do so, you can try the following steps:

    1. Follow this github readme to start the annotation process for PDF documents.
    2. Once the annotations are produced. You can use Comprehend CreateEntityRecognizer API to train a custom entity model for Semi-structured document”
    3. Once entity recognizer is trained, you can use StartEntitiesDetectionJob API to run inference for PDF documents