[SOLVED] gRPC Client thread hangs on deadline exceeded error

gRPC Client thread hangs on deadline exceeded error

I am using gRPC 1.55.1 and observing an issue similar to the one discussed below

https://github.com/grpc/grpc-java/issues/9069

I have set the KeepAlive as below on the client side like below

var channel = ManagedChannelBuilder.forAddress(network.getIp(), network.getPort())
        .keepAliveTime(130, TimeUnit.SECONDS)
        .maxInboundMessageSize(maxInboundMessageSize)
        .maxInboundMetadataSize(maxInboundMetadataSize)
        .enableRetry()
        .build();

var stub = HelloServiceGrpc.newBlockingStub(channel).withDeadline(Deadline.after(115, TimeUnit.SECONDS));
stub.sayHello();
stub.sayHello();

In the server side also keepAliveTime is set as suggested in the above GitHub issue.

Grpc.newServerBuilderForPort(port, InsecureServerCredentials.create())
        .addService(new GreeterImpl())
        .keepAliveTime(130, TimeUnit.SECONDS)
        .build()
        .start();

In my case client calls server1 then server1 acts as a client to server2.

I am observing that when the deadline is exceeded in server2, client in server1 receives an error like below which the final client receives as expected

io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: context timed out
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271)
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252)
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165)

But in some rare cases, the client thread hangs and did not receive the deadline exceeded from the server. The client thread hangs like below

at jdk.internal.misc.Unsafe.park(java.base@17.0.9/Native Method)
    - parking to wait for  <0x0000000767a53a00> (a io.grpc.stub.ClientCalls$ThreadlessExecutor)
    at java.util.concurrent.locks.LockSupport.park(java.base@17.0.9/LockSupport.java:211)
    at io.grpc.stub.ClientCalls$ThreadlessExecutor.waitAndDrain(ClientCalls.java:748)
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:157)

I waited for about 2 hours & it did not recover. The only way to recover from this is to restart the client application. I observed this issue 3-4 times in last 2 months.

Can someone let me know

What I am doing wrong or is there any known issue in the gRPC 1.55.1 that I am using?
Is there any timeout config I can set on the gRPC client side so that the client threads do not hang indefinitely?

Solution

Most likely you are hitting this bug: https://github.com/grpc/grpc-java/issues/10838

The fix is planned for 1.63.

As a workaround, disabling the retry should help.