I am using a GCP workflow and eventarc trigger connected to cloud storage to have a document evaluated by Document AI when the cloud storage bucket receives it. The issue I'm encountering is, whenever I try and evaluate the document, I get an error stating memory limit exceeded. My test document's size is 120ish kilobytes, and upon research, the workflow can handle a response size of up to 2MB. Originally, I thought it was because I was trying to log the response, just to see what it looked like, so I switched to having it stored in a separate storage bucket, but I've continued getting the same error. Is it because I need to compress the response coming from Document AI and THEN try and save it into the bucket, because the response is too large? Below is my current YAML code:
main:
params: [event]
steps:
- start:
call: sys.log
args:
text: ${event}
- vars:
assign:
- file_name: ${event.data.name}
- mime_type: ${event.data.contentType}
- input_gcs_bucket: ${event.data.bucket}
- batch_doc_process:
call: googleapis.documentai.v1.projects.locations.processors.process
args:
name: ${"projects/" + sys.get_env("GOOGLE_CLOUD_PROJECT_ID") + "/locations/" + sys.get_env("LOCATION") + "/processors/" + sys.get_env("PROCESSOR_ID")}
location: ${sys.get_env("LOCATION")}
body:
gcsDocument:
gcsUri: ${"gs://" + input_gcs_bucket + "/" + file_name}
mimeType: ${mime_type}
skipHumanReview: true
result: doc_process_resp
- store_process_resp:
call: googleapis.storage.v1.objects.insert
args:
bucket: ${sys.get_env("OUTPUT_GCS_BUCKET")}
name: ${file_name}
body: ${doc_process_resp}
I just had to change it up and use a batch process request instead of a single doc process request. This was so I could specify the storage bucket to send it to after it was done being processed. So we go from this function:
- batch_doc_process:
call: googleapis.documentai.v1.projects.locations.processors.process
args:
name: ${"projects/" + sys.get_env("GOOGLE_CLOUD_PROJECT_ID") + "/locations/" + sys.get_env("LOCATION") + "/processors/" + sys.get_env("PROCESSOR_ID")}
location: ${sys.get_env("LOCATION")}
body:
gcsDocument:
gcsUri: ${"gs://" + input_gcs_bucket + "/" + file_name}
mimeType: ${mime_type}
skipHumanReview: true
result: doc_process_resp
to this one:
- batch_doc_process:
call: googleapis.documentai.v1.projects.locations.processors.batchProcess
args:
name: ${"projects/" + sys.get_env("GOOGLE_CLOUD_PROJECT_ID") + "/locations/" + sys.get_env("LOCATION") + "/processors/" + sys.get_env("PROCESSOR_ID")}
location: ${sys.get_env("LOCATION")}
body:
inputDocuments:
gcsDocuments:
documents:
- gcsUri: ${"gs://" + input_gcs_bucket + "/" + file_name}
mimeType: ${mime_type}
documentOutputConfig:
gcsOutputConfig:
gcsUri: ${sys.get_env("OUTPUT_GCS_BUCKET")}
skipHumanReview: true
result: doc_process_resp
A small change, but one that actually allows it to work propely