javagoogle-cloud-platformcloud-document-ai

Document AI batch processing timeout using Java


I am trying to batch process a set of documents using Document AI and its Java SDK. My code is derived from the batch processing example for Java (seen here), but I have modified it to add more than one document (40 documents of up to 5 pages each).

I wait for the result of the batch processing using the same code as in the example:

      // Batch process document using a long-running operation.
      // You can wait for now, or get results later.
      // Note: first request to the service takes longer than subsequent
      // requests.
      System.out.println("Waiting for operation to complete...");
      future.get();

      System.out.println("Document processing complete.");

After a bit less than 5 minutes, I always get the following exception:

feb. 06, 2024 6:34:08 EM com.google.api.gax.longrunning.OperationTimedPollAlgorithm shouldRetry
VARNING: The task has been cancelled. Please refer to https://github.com/googleapis/google-cloud-java#lro-timeouts for more information
java.util.concurrent.CancellationException: Task was cancelled.
    at com.google.common.util.concurrent.AbstractFuture.cancellationExceptionWithCause(AbstractFuture.java:1560)
    at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:590)
    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:571)
    at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:91)
    at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:67)
    at com.google.api.gax.longrunning.OperationFutureImpl.get(OperationFutureImpl.java:125)
    at ...

What can I do to avoid this timeout? I have tried with a smaller amount of documents (25), but that times out as well.


Solution

  • From the link listed in the error message:


    LRO Timeouts

    The polling operations have a default timeout that varies from service to service. The library will throw a java.util.concurrent.CancellationException with the message: Task was cancelled. if the timeout exceeds the operation. A CancellationException does not mean that the backend GCP Operation was cancelled. This exception is thrown from the client library when it has exceeded the total timeout without receiving a successful status from the operation. Our client libraries respect the configured values set in the OperationTimedPollAlgorithm for each RPC.

    Note: The client library handles the Operation's polling mechanism for you. By default, there is no need to manually poll the status yourself.


    You don't need to continuously poll long-running operations and it's not advised to do so, especially when processing a large number of documents, as it could take a long time. In this case, you can check the output Google Cloud Storage bucket at a later time once the operation is completed, rather than polling/waiting for it to complete.

    If you want your application to block/wait for the operation to complete, then you can extend the timeout time as shown in the link.