machine-learninggoogle-cloud-platformnlpcloud-document-ai

In GCP's DocumentAI, when importing documents via API, is it possible to add a Document Type label?


I am creating a Custom Document Classification Processor in GCP's DocumentAI platform, and am trying to understand whether it is possible to assign a Document Type label to documents when importing them to train the Processor.

This StackOverflow answer notes that GCP's DocumentAI platform does expose an API to create a Dataset and upload documents to it. With that in mind, I know that it is possible to use the DocumentAI API to create a dataset, and then (as in the code below) to update that Dataset's schema with document types:

from google.cloud import documentai_v1beta3 as documentai

document_processor_service_client = documentai.DocumentProcessorServiceClient()

processor_name = 'projects/123456789/locations/us/processors/example123'

processor = document_processor_service_client.get_processor(documentai.GetProcessorRequest(name=processor_name))

dataset_schema = document_service_client.get_dataset_schema(documentai.GetDatasetSchemaRequest(name=f'{processor.name}/dataset/datasetSchema'))
dataset_schema

dataset_schema.document_schema.entity_types = [
    {
    "name": "test1",
    "base_types": ["document"],
    "entity_type_metadata": {
    },
    "display_name": "test1"
  },
  {
    "name": "test2",
    "base_types": ["document"],
    "entity_type_metadata": {
    },
    "display_name": "test2
  },
    {
    "name": "test4",
    "base_types": ["document"],
    "entity_type_metadata": {
    },
    "display_name": "test4"
  }
]

update_schema_request = document_service_client.update_dataset_schema(documentai.UpdateDatasetSchemaRequest(dataset_schema=dataset_schema))

I know that the API also allows importing one or more documents, as in this code:

import_documents_request = document_service_client.import_documents(
    documentai.ImportDocumentsRequest(
        dataset=f"{processor.name}/dataset",
        batch_documents_import_configs=[
            documentai.ImportDocumentsRequest.BatchDocumentsImportConfig(
                auto_split_config=documentai.ImportDocumentsRequest.BatchDocumentsImportConfig.AutoSplitConfig(
                    training_split_ratio=0.7
                ),
                batch_input_config=documentai.BatchDocumentsInputConfig(
                    gcs_documents=documentai.GcsDocuments(
                        documents=[
                            documentai.GcsDocument(
                                gcs_uri="gs://path/to/document.pdf",
                                mime_type="application/pdf",
                            )
                        ]
                    )
                ),
            )
        ],
    ),
)

When manually uploading documents in Cloud Console, there is an option for applying a Document Type label to all imported documents:

Screenshot of "Import documents" interface in Cloud Console

I can't tell from the DocumentAI documentation: Is it possible to similarly assign a Document Type label to one or more Documents via the API? Whether during upload, or after? I have a lot of documents ready to use in a training set, and just need to give each an overall Document Type label (vs. annotating specific fields in each document), so I am looking for a way to do so programmatically, rather than manually.


Solution

  • The Document AI API does not currently support applying a label on import when using the importDocuments() method. You need to use the Cloud Console to do bulk labeling.

    I would recommend adding more details to the public issue tracker nestor-ceniza-jr@ created so that this can be prioritized by the product development team.

    https://issuetracker.google.com/303285767