pythonparsinggoogle-apicloud-document-ai

Document AI form parsing on documents with different format


We have a client that wishes to automatically extract information in different PDF files to fill their form. Those documents are all different in their format, for example, sometimes to extract the client name, it can be found on top of the first page of the PDF in something like "Company name : Google Inc.", but in another document, this same information can be found in page 30, somewhere in a sentence like "The client, Google Inc, want to[...]"

Is it possible to train a processor to be able to parse those data in a lot of different document types? If yes, how long could it be?


Solution

  • When creating Custom Document Extractor processors, it's recommended to create a different processor for each document format/structure to produce the most consistent results.

    For extracting the entity within sentences, you can try using a Custom Document Extractor to see how the results turn out if you have sufficient amounts of training data. Depending on the structure, you might also be able to use the pretrained Contract Parser (allowlist access only).

    Or try sending the document to the Document OCR processor to extract the text, then create an AutoML Text model for entity extraction to find the specific entities within the freeform text.