I am creating a Custom Document Classification Processor in GCP's DocumentAI platform, and am trying to understand whether it is possible to assign a Document Type label to documents when importing them to train the Processor.
This StackOverflow answer notes that GCP's DocumentAI platform does expose an API to create a Dataset and upload documents to it. With that in mind, I know that it is possible to use the DocumentAI API to create a dataset, and then (as in the code below) to update that Dataset's schema with document types:
from google.cloud import documentai_v1beta3 as documentai
document_processor_service_client = documentai.DocumentProcessorServiceClient()
processor_name = 'projects/123456789/locations/us/processors/example123'
processor = document_processor_service_client.get_processor(documentai.GetProcessorRequest(name=processor_name))
dataset_schema = document_service_client.get_dataset_schema(documentai.GetDatasetSchemaRequest(name=f'{processor.name}/dataset/datasetSchema'))
dataset_schema
dataset_schema.document_schema.entity_types = [
{
"name": "test1",
"base_types": ["document"],
"entity_type_metadata": {
},
"display_name": "test1"
},
{
"name": "test2",
"base_types": ["document"],
"entity_type_metadata": {
},
"display_name": "test2
},
{
"name": "test4",
"base_types": ["document"],
"entity_type_metadata": {
},
"display_name": "test4"
}
]
update_schema_request = document_service_client.update_dataset_schema(documentai.UpdateDatasetSchemaRequest(dataset_schema=dataset_schema))
I know that the API also allows importing one or more documents, as in this code:
import_documents_request = document_service_client.import_documents(
documentai.ImportDocumentsRequest(
dataset=f"{processor.name}/dataset",
batch_documents_import_configs=[
documentai.ImportDocumentsRequest.BatchDocumentsImportConfig(
auto_split_config=documentai.ImportDocumentsRequest.BatchDocumentsImportConfig.AutoSplitConfig(
training_split_ratio=0.7
),
batch_input_config=documentai.BatchDocumentsInputConfig(
gcs_documents=documentai.GcsDocuments(
documents=[
documentai.GcsDocument(
gcs_uri="gs://path/to/document.pdf",
mime_type="application/pdf",
)
]
)
),
)
],
),
)
When manually uploading documents in Cloud Console, there is an option for applying a Document Type label to all imported documents:
I can't tell from the DocumentAI documentation: Is it possible to similarly assign a Document Type label to one or more Documents via the API? Whether during upload, or after? I have a lot of documents ready to use in a training set, and just need to give each an overall Document Type label (vs. annotating specific fields in each document), so I am looking for a way to do so programmatically, rather than manually.
The Document AI API does not currently support applying a label on import when using the importDocuments()
method. You need to use the Cloud Console to do bulk labeling.
I would recommend adding more details to the public issue tracker nestor-ceniza-jr@ created so that this can be prioritized by the product development team.