I have below tech stack for a spring amqp application consuming messages from rabbitmq -
Spring boot 2.2.6.RELEASE
Reactor Netty 0.9.12.RELEASE
Reactor Core 3.3.10.RELEASE
Application is deployed on 4 core RHEL.
Below are some of the configurations being used for rabbitmq
@Bean
public CachingConnectionFactory connectionFactory() {
CachingConnectionFactory cachingConnectionFactory = new CachingConnectionFactory();
cachingConnectionFactory.setHost(<<HOST NAME>>);
cachingConnectionFactory.setUsername(<<USERNAME>>);
cachingConnectionFactory.setPassword(<<PASSWORD>>);
cachingConnectionFactory.setChannelCacheSize(50);
return cachingConnectionFactory;
}
@Bean
public SimpleRabbitListenerContainerFactory rabbitListenerContainerFactory() {
SimpleRabbitListenerContainerFactory factory = new SimpleRabbitListenerContainerFactory();
factory.setConnectionFactory(connectionFactory());
factory.setMaxConcurrentConsumers(50);
factory.setMessageConverter(new Jackson2JsonMessageConverter());
factory.setDefaultRequeueRejected(false); /** DLQ is in place **/
return factory;
}
The consumers make downstream API calls using spring webclient in synchronous mode. Below is configuration for Webclient
@Bean
public WebClient webClient() {
ConnectionProvider connectionProvider = ConnectionProvider
.builder("fixed")
.lifo()
.pendingAcquireTimeout(Duration.ofMillis(200000))
.maxConnections(16)
.pendingAcquireMaxCount(3000)
.maxIdleTime(Duration.ofMillis(290000))
.build();
HttpClient client = HttpClient.create(connectionProvider);
client.tcpConfiguration(<<connection timeout, read timeout, write
timeout is set here....>>);
Webclient.Builder builder = Webclient.builder()
.baseUrl(<<base URL>>)
.clientConnector(new ReactorClientHttpConnector(client));
return builder.build();
}
This webclient is autowired into a @Service class as
@Autowired
private Webclient webClient;
and used as below in two places. First place is one call -
public DownstreamStatusEnum downstream(String messageid, String payload, String contentType) {
return call(messageid,payload,contentType);
}
private DownstreamStatusEnum call(String messageid, String payload, String contentType) {
DownstreamResponse response = sendRequest(messageid,payload,contentType)
.**block()**;
return response;
}
private Mono<DownstreamResponse> sendRequest(String messageid, String payload, String contentType) {
return webClient
.method(POST)
.uri(<<URI>>)
.contentType(MediaType.valueOf(contentType))
.body(BodyInserters.fromValue(payload))
.exchange()
.flatMap(response -> response.bodyToMono(DownstreamResponse.class));
}
Other place requires parallel downstream calls and has been implemented as below
private Flux<DownstreamResponse> getValues(List<DownstreamRequest> reqList, String messageid) {
return Flux
.fromIterable(reqList)
.parallel()
.runOn(Schedulers.elastic())
.flatMap(s -> {
return webClient
.method(POST)
.uri(<<downstream url>>)
.body(BodyInserters.fromValue(s))
.exchange()
.flatMap(response -> {
if(response.statusCode().isError()) {
return Mono.just(new DownstreamResponse());
}
return response.bodyToMono(DownstreamResponse.class);
});
}).sequential();
}
public List<DownstreamResponse> updateValue(List<DownstreamRequest> reqList,String messageid) {
return getValues(reqList,messageid)
.collectList()
.**block()**;
}
The application has been working fine for past one year or so. Of late, we are seeing an issue whereby one or more consumers seem to just get stuck with the default prefetch (250) number of messages in unack status. The only way to fix the issue is to restart app.
We have not done any code changes recently. Also there have been no infra changes recently either.
When this happens, we took thread dumps. The pattern observed is similar. Most of the consumer threads are in TIMED_WAITING status while one or two consumers show in WAITING state with below stacks -
"org.springframework.amqp.rabbit.RabbitListenerEndpointContainer#0-13" waiting for condition ...
java.lang.Thread.State: WAITING (parking)
- parking to wait for ......
at .......
at .......
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(......
at reactor.core.publisher.Mono.block(....
at .........WebClientServiceImpl.call(...
Also see below -
"org.springframework.amqp.rabbit.RabbitListenerEndpointContainer#0-13" waiting for condition ...
java.lang.Thread.State: WAITING (parking)
- parking to wait for ......
at .......
at .......
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(......
at reactor.core.publisher.Mono.block(....
at .........WebClientServiceImpl.updateValue(...
Not exactly sure if this thread dump is showing that consumer threads are actually stuck at this "block" call.
Please help advise what could be the issue here and what steps need to be taken to fix this. Earlier we thought it may be some issue with rabbitmq/spring aqmp but based on thread dump, looks like issue with webclient "block" call.
On adding Blockhound, it is printing below stacktrace in log file -
Error has been observed at following site(s)
Checkpoint Request to POST https://....... [DefaultWebClient]
Stack Trace:
at java.lang.Object.wait
......
at java.net.InetAddress.checkLookupTable
at java.net.InetAddress.getAddressFromNameService
......
at io.netty.util.internal.SocketUtils$8.run
......
at io.netty.resolver.DefaultNameResolver.doResolve
Sorry, just realized that the flatMap
in the parallel flux call was actually like below:
.flatMap(response -> {
if (response.statusCode().isError()) {
return Mono.just(new DownstreamResponse());
}
return response.bodyToMono(DownstreamResponse.class);
});
So, in error scenarios, I think the underlying connection was not being properly released. When I updated it like below, it seemed to have fixed the issue:
.flatMap(response -> {
if (response.statusCode().isError()) {
response.releaseBody().thenReturn(Mono.just(new DownstreamResponse()));
}
return response.bodyToMono(DownstreamResponse.class);
});