I am trying to upload unstructured data to a Google Cloud Platform (GCP) data store from a GCP Storage Bucket using the Python SDK. I want to use unstructured data with meta data which is mentioned here. The process involves:
CONFIG
to CONTENT REQUIRED
..jsonl
meta data file which are all at the root of my bucket.The code I am attempting to use for Point 3 is below which I copied from Google's documentation (second code snippet under "Import Documents").
client_options = (
ClientOptions(api_endpoint=f"{LOCATION}-discoveryengine.googleapis.com")
if LOCATION != "global"
else None
)
# Create a client
client = discoveryengine.DocumentServiceClient(client_options=client_options)
parent = client.branch_path(
project=PROJECT_ID,
location=LOCATION,
data_store=DATA_STORE_ID,
branch="default_branch",
)
request = discoveryengine.ImportDocumentsRequest(
parent=parent,
gcs_source=discoveryengine.GcsSource(
# Multiple URIs are supported
input_uris=[GCS_URI],
# Options:
# - `content` - Unstructured documents (PDF, HTML, DOC, TXT, PPTX)
# - `custom` - Unstructured documents with custom JSONL metadata
# - `document` - Structured documents in the discoveryengine.Document format.
# - `csv` - Unstructured documents with CSV metadata
data_schema="custom",
),
id_field="id",
# Options: `FULL`, `INCREMENTAL`
reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.FULL,
)
# Make the request
operation = client.import_documents(request=request)
print(f"Waiting for operation to complete: {operation.operation.name}")
response = operation.result()
print(response)
The GCS_URI
variable is the link to the gsutil URI of the metadata.jsonl
file (gs://meta-data-testing/metadata.jsonl
), and that file looks like this:
{"id": "1", "structData": {"title": "Coldsmokesubmittal", "category": "212027"}, "content": {"mimeType": "application/pdf", "uri": "gs://meta-data-testing/ColdSmokeSubmittal.pdf"}}
{"id": "2", "structData": {"title": "Defssubmittal", "category": "212027"}, "content": {"mimeType": "application/pdf", "uri": "gs://meta-data-testing/DEFSSubmittal.pdf"}}
{"id": "3", "structData": {"title": "Cmu Submittal", "category": "222039"}, "content": {"mimeType": "application/pdf", "uri": "gs://meta-data-testing/CMU_Submittal.pdf"}}
{"id": "4", "structData": {"title": "Concrete Mix Submittal", "category": "222039"}, "content": {"mimeType": "application/pdf", "uri": "gs://meta-data-testing/Concrete_Mix_Submittal.pdf"}}
When I run my code, I get this response:
error_samples {
code: 3
message: "To create document without content, content config of data store must be NO_CONTENT."
details {
type_url: "type.googleapis.com/google.rpc.ResourceInfo"
value: "\022\'gs://meta-data-testing/metadata.jsonl:1"
}
}
Which repeats 3 more times for each line of my .jsonl
file.
Please if anyone has tried adding unstructured documents with meta data, please tell me where I am going wrong or a method that you were able to use to successfully execute this process
I see it is telling me to change the data store config to NO_CONTENT
but when I do that, only the meta data is uploaded to the data store and I am not able to actually perform a search on the documents via my Vertex AI app. I think this error might be a secondary output from whatever the real issue is.
I have tried manually uploading on GCP itself:
but I get this error when I try:
message: "INVALID_FORMAT gcsInputuri"
status: {
@type: "type.googleapis.com/google.rpc.Status"
code: 3
message: "The provided GCS URI has invalid unstructured data format. Please provide a valid GCS path in either NDJSON(.ndjson) or JSON Lines(.jsonl) format."
}
data_schema
to:data_schema="documents"
id_field
input parameterYou still only have to provide the URI to the meta data file which includes lines containing the content.uri
field with the links to the documents in your Storage Bucket (like is shown in the question). See more clear definitions on the data_schema
parameter here.
The error specifying you must have "NO_CONTENT"
in the data store config is correct because using data_schema="custom"
means we are only uploading meta data to search through.
I made the changes and uploaded successfully! I added a project_number
field to the meta data which Vertex AI was able to parse out:
I tested using the API and was able to filter results by the project_number
.