python google-cloud-platform google-cloud-storage google-cloud-datastore google-cloud-vertex-ai

How do I upload unstructured documents with meta data to Google Cloud Platform data store with Python SDK?

I am trying to upload unstructured data to a Google Cloud Platform (GCP) data store from a GCP Storage Bucket using the Python SDK. I want to use unstructured data with meta data which is mentioned here. The process involves:

Creating a GCP Data Store which I have done according to this documentation. I have setup all the necessary access and set the CONFIG to CONTENT REQUIRED.
Create a GCP Cloud Storage Bucket which contains 4 PDF documents (for now) and a .jsonl meta data file which are all at the root of my bucket.
Populating the Data Store with a Document Import Request using documents from a Google Cloud Storage Bucket.

The code I am attempting to use for Point 3 is below which I copied from Google's documentation (second code snippet under "Import Documents").


client_options = (
    ClientOptions(api_endpoint=f"{LOCATION}-discoveryengine.googleapis.com")
    if LOCATION != "global"
    else None
)

# Create a client
client = discoveryengine.DocumentServiceClient(client_options=client_options)

parent = client.branch_path(
    project=PROJECT_ID,
    location=LOCATION,
    data_store=DATA_STORE_ID,
    branch="default_branch",
)

request = discoveryengine.ImportDocumentsRequest(
    parent=parent,
    gcs_source=discoveryengine.GcsSource(
        # Multiple URIs are supported
        input_uris=[GCS_URI],
        # Options:
        # - `content` - Unstructured documents (PDF, HTML, DOC, TXT, PPTX)
        # - `custom` - Unstructured documents with custom JSONL metadata
        # - `document` - Structured documents in the discoveryengine.Document format.
        # - `csv` - Unstructured documents with CSV metadata
        data_schema="custom",
    ),
    id_field="id",
    # Options: `FULL`, `INCREMENTAL`
    reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.FULL,
)

# Make the request
operation = client.import_documents(request=request)

print(f"Waiting for operation to complete: {operation.operation.name}")
response = operation.result()
print(response)

The GCS_URI variable is the link to the gsutil URI of the metadata.jsonl file (gs://meta-data-testing/metadata.jsonl), and that file looks like this:

{"id": "1", "structData": {"title": "Coldsmokesubmittal", "category": "212027"}, "content": {"mimeType": "application/pdf", "uri": "gs://meta-data-testing/ColdSmokeSubmittal.pdf"}}
{"id": "2", "structData": {"title": "Defssubmittal", "category": "212027"}, "content": {"mimeType": "application/pdf", "uri": "gs://meta-data-testing/DEFSSubmittal.pdf"}}
{"id": "3", "structData": {"title": "Cmu Submittal", "category": "222039"}, "content": {"mimeType": "application/pdf", "uri": "gs://meta-data-testing/CMU_Submittal.pdf"}}
{"id": "4", "structData": {"title": "Concrete Mix Submittal", "category": "222039"}, "content": {"mimeType": "application/pdf", "uri": "gs://meta-data-testing/Concrete_Mix_Submittal.pdf"}}

When I run my code, I get this response:

error_samples {
  code: 3
  message: "To create document without content, content config of data store must be NO_CONTENT."
  details {
    type_url: "type.googleapis.com/google.rpc.ResourceInfo"
    value: "\022\'gs://meta-data-testing/metadata.jsonl:1"
  }
}

Which repeats 3 more times for each line of my .jsonl file.

Please if anyone has tried adding unstructured documents with meta data, please tell me where I am going wrong or a method that you were able to use to successfully execute this process

My Solution Attempts

Change Data Store Config

I see it is telling me to change the data store config to NO_CONTENT but when I do that, only the meta data is uploaded to the data store and I am not able to actually perform a search on the documents via my Vertex AI app. I think this error might be a secondary output from whatever the real issue is.

Upload via GCP

I have tried manually uploading on GCP itself:

but I get this error when I try:

message: "INVALID_FORMAT gcsInputuri"
status: {
  @type: "type.googleapis.com/google.rpc.Status"
  code: 3
  message: "The provided GCS URI has invalid unstructured data format. Please provide a valid GCS path in either NDJSON(.ndjson) or JSON Lines(.jsonl) format."
}

Solution

Fix

Change data_schema to:

data_schema="documents"

Drop the id_field input parameter

You still only have to provide the URI to the meta data file which includes lines containing the content.uri field with the links to the documents in your Storage Bucket (like is shown in the question). See more clear definitions on the data_schema parameter here.

Why

The error specifying you must have "NO_CONTENT" in the data store config is correct because using data_schema="custom" means we are only uploading meta data to search through.

Results

I made the changes and uploaded successfully! I added a project_number field to the meta data which Vertex AI was able to parse out:

I tested using the API and was able to filter results by the project_number.