I am trying to batch process a set of documents using Document AI and its Java SDK. My code is derived from the batch processing example for Java (seen here), but I have modified it to add more than one document (40 documents of up to 5 pages each).
I wait for the result of the batch processing using the same code as in the example:
// Batch process document using a long-running operation.
// You can wait for now, or get results later.
// Note: first request to the service takes longer than subsequent
// requests.
System.out.println("Waiting for operation to complete...");
future.get();
System.out.println("Document processing complete.");
After a bit less than 5 minutes, I always get the following exception:
feb. 06, 2024 6:34:08 EM com.google.api.gax.longrunning.OperationTimedPollAlgorithm shouldRetry
VARNING: The task has been cancelled. Please refer to https://github.com/googleapis/google-cloud-java#lro-timeouts for more information
java.util.concurrent.CancellationException: Task was cancelled.
at com.google.common.util.concurrent.AbstractFuture.cancellationExceptionWithCause(AbstractFuture.java:1560)
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:590)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:571)
at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:91)
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:67)
at com.google.api.gax.longrunning.OperationFutureImpl.get(OperationFutureImpl.java:125)
at ...
What can I do to avoid this timeout? I have tried with a smaller amount of documents (25), but that times out as well.
From the link listed in the error message:
The polling operations have a default timeout that varies from service to service.
The library will throw a java.util.concurrent.CancellationException
with the message:
Task was cancelled.
if the timeout exceeds the operation. A CancellationException
does not mean that the backend GCP Operation was cancelled. This exception is thrown from the
client library when it has exceeded the total timeout without receiving a successful status from the operation.
Our client libraries respect the configured values set in the OperationTimedPollAlgorithm for each RPC.
Note: The client library handles the Operation's polling mechanism for you. By default, there is no need to manually poll the status yourself.
You don't need to continuously poll long-running operations and it's not advised to do so, especially when processing a large number of documents, as it could take a long time. In this case, you can check the output Google Cloud Storage bucket at a later time once the operation is completed, rather than polling/waiting for it to complete.
If you want your application to block/wait for the operation to complete, then you can extend the timeout time as shown in the link.