The API documentation shows that the DocumentSchema has EntityType children which should contain details of all fields in a Custom Extractor. I am able to obtain the DocumentSchema as expected. However the EntityType array is empty instead of containing all of the fields.
You can see all of the fields via the console UI:
Here is the code demonstrating the issue:
import os
from google.cloud import documentai_v1 as documentai
from google.api_core.client_options import ClientOptions
# Set the environment variable for Google Cloud credentials
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './google-credentials.json'
def print_schema_fields(schema):
print("\nSchema Fields:")
for entity_type in schema.entity_types:
if entity_type.name == "custom_extraction_document_type":
print(f"Entity Type Name: {entity_type.name}")
print(f"Base Types: {entity_type.base_types}")
if hasattr(entity_type, 'properties'):
print(f"Properties found: {len(entity_type.properties)}")
for property in entity_type.properties:
print(f"Property: {property.name}")
else:
print("No properties attribute found")
def get_processor_schema(client, processor_name):
processor = client.get_processor(name=processor_name)
versions = client.list_processor_versions(parent=processor.name)
latest_version = next(iter(versions), None)
if latest_version:
print(f"Processor Version: {latest_version.display_name}")
schema = getattr(latest_version, 'document_schema', None)
if schema:
print(f"Schema Name: {schema.display_name}")
print_schema_fields(schema)
else:
print("No schema found")
# Setup
project_id = 'api-pr....25020'
location = 'us'
opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
client = documentai.DocumentProcessorServiceClient(client_options=opts)
# Get custom extractors
parent = f"projects/{project_id}/locations/{location}"
processors = client.list_processors(parent=parent)
custom_extractors = [p for p in processors if p.type_ == "CUSTOM_EXTRACTION_PROCESSOR"]
for extractor in custom_extractors:
print(f"\nProcessor: {extractor.display_name}")
get_processor_schema(client, extractor.name)
Here is the output showing successful retrieval of the DocumentSchema, and empty EntityTypes:
Processor: TaxProcessor
Processor Version: Google Stable
Schema Name: CDE Schema
Schema Fields:
Entity Type Name: custom_extraction_document_type
Base Types: ['document']
Properties found: 0
Am I incorrectly using the API? How can I read the configured fields for a Custom Extractor (and can I create/update them?)
Thanks,
Stu
Your code produces correct output on my end, but only for the processor versions that were actually trained on the dataset. It seems that pretrained models (like Google Stable) do not have a populated schema associated with them, only a stub:
Processor Version: Google Stable
display_name: "CDE Schema"
description: "Document Schema for the CDE Processor"
entity_types {
name: "custom_extraction_document_type"
base_types: "document"
}
If you instead take a version that you have trained yourself, it should show you all the properties:
Processor Version: few-shot-test
display_name: "CDE Schema"
description: "Document Schema for the CDE Processor"
entity_types {
name: "custom_extraction_document_type"
base_types: "document"
properties {
name: "manufacturer"
value_type: "string"
occurrence_type: OPTIONAL_MULTIPLE
}
properties {
name: "address"
value_type: "address"
occurrence_type: OPTIONAL_MULTIPLE
}
properties {
name: "description"
value_type: "string"
occurrence_type: OPTIONAL_MULTIPLE
}
...
}
You can use a pretrained model with your own schema by modifying schema_override
parameter of ProcessOptions:
process_options = google.cloud.documentai_v1beta3.ProcessOptions(
schema_override=google.cloud.documentai_v1beta3.types.document_schema.Document
Schema.from_json(json_str))
JSON schema should have the following format:
{
"displayName": "CDE Schema",
"description": "Document Schema for the CDE Processor",
"entityTypes": [
{
"name": "custom_extraction_document_type",
"baseTypes": [
"document"
],
"properties": [
{
"name": "description",
"valueType": "string",
"occurrenceType": 2,
"propertyMetadata": {
"inactive": false
},
"description": "",
"displayName": ""
}
...
],
"entityTypeMetadata": {
"inactive": false
},
"displayName": "",
"description": ""
}
]
}
UPDATE
You can get access to the full schema by using GetDatasetSchemaRequest
as described in this post
client = documentai_v1beta3.DocumentServiceClient()
schema_request = documentai_v1beta3.GetDatasetSchemaRequest(name=f"projects/{project_id}/
locations/us/processors/{processor_id}/dataset/datasetSchema")
schema = client.get_dataset_schema(request=request)
Updating schema:
update_schema_request = client.update_dataset_schema(documentai_v1beta3.UpdateDatasetSchemaRequest(dataset_schema=new_schema))