javareact-contextmicrometerserviceloader

Java ServiceLoader sometimes does not find registered services


For our application (microservice) we're using a bunch of libraries, among others we use

Our problem is with the context propagation:

When we upload files to the Azure Storage Account our imperative code uses the Azure Storage driver/libs which provide a synchronous interface but under the hood are reactive.

When the execution steps from imperative code to reactive and back the class io.micrometer.context.ContextRegistry from the io.micrometer:context-propagation library comes into place which handles the propagation of information from ThreadLocals to Contexts and vice versa.

The io.micrometer.context.ContextRegistry is a singleton which loads the ContextAccessors via the Java Service Provider Interface (SPI) which uses the ServiceLoader class to find all ContextAccessors. According to the SPI spec they are located in their libraries (io.projectreactor:reactor-core, ...) and there made be discoverable by the corresponding file META-INF/services/io.micrometer.context.ContextAccessor.

We're running our services on Kubernetes clusters, of which we have separate ones that only differ in the available nodes/nodesize: DEV, SYSTEST, QA, PROD_EU, PROD_US.

So far so good - all works fine on local machines, DEV, SYSTEST and QA environment - meaning uploading files to the Azure Storage Account.

After deploying to PROD_EU and PROD_US we got errors that files could not be uploaded to the Azure Storage Account. The error was:

java.lang.IllegalStateException: No ContextAccessor for contextType: class reactor.util.context.Context1 
at io.micrometer.context.ContextRegistry.getContextAccessorForRead(ContextRegistry.java:206) 
at io.micrometer.context.DefaultContextSnapshotFactory.setAllThreadLocalsFrom(DefaultContextSnapshotFactory.java:119) 
at io.micrometer.context.DefaultContextSnapshotFactory.setThreadLocalsFrom(DefaultContextSnapshotFactory.java:109) 
at reactor.core.publisher.ContextPropagation.lambda$contextRestoreForHandle$2(ContextPropagation.java:162) 
...

In fact the error occurred in the io.micrometer.context.ContextRegistry class where the ContextAccessors that are loaded via the Java Service Provider mechanism are being used.

To track that down we inserted this piece of code at the method where the files get pushed to the Azure Storage Account:

log.info("!!!!! ContextAccessors !!!!!");
ContextRegistry.getInstance().getContextAccessors().forEach(contextAccessor ->
   log.info("!!!!! ContextAccessor: <class: {}, readableType: {}>", contextAccessor.getClass(), contextAccessor.readableType())
);

It turned out that on PROD_EU and PROD_US most of the time no ContextAccessors have been loaded ...

Running the service on different environments gives:

Environment
Local machines ContextAccessors loaded
DEV ContextAccessors loaded
SYSTEST ContextAccessors loaded
QA ContextAccessors loaded
PROD_EU On 1 pod ContextAccessors loaded
On 2 pods no ContextAccessors loaded
After some killing and restarting of pods:
On 2 pod ContextAccessors loaded
On 1 pods no ContextAccessors loaded
PROD_US On 1 pod ContextAccessors loaded
On 2 pods no ContextAccessors loaded

Since the problem "ContextAccessors not loaded by Java Service Provider mechanism" only occurs on two environments and there is not affecting all pods but only some we suspect some ClassLoading/runtime behaviour.

But it's getting even more crazy:

We moved the above mentioned code snipped from the method to the constructor of the class which made it execute at the application startup. This made the problem gone away on all(!) stages including PROD_EU and PROD_US ...

Removing this code snipped from the constructor again let the problem appear again only on PROD_EU and PROD_US.

This makes us even more convinced that the problem is caused by some Classloading/runtime problem.

The class io.micrometer.context.ContextRegistry is a singleton which in fact means it is possible that each classloader has its own singleton instance. It seems by calling ContextRegistry.getInstance() in the constructor the point in time when the singleton is created changes which changes the behaviour of loading the ContextAccessors.

However, neither the point in time when the ContextRegistry singleton instance is being created nor multiple instances of the ContextRegistry (in different classloaders) should cause one of them to not find the ContextAccessors in the libraries which are correct published accoring to the SPI spec.

Any ideas what could cause this weird problem?


UPDATE 1

I have made progress in my research: We're reading 500 messages as batch from Kafka and then processing in parallel:

...
messages.parallelStream().forEach(message -> {
  ...
}

The first message is always processed by the thread that fetched the 500 message batch from the topic (threadName="org.springframework.kafka.KafkaListenerEndpointContainer#5-0-C-1").

The next messages are processed by worker threads (threadName="ForkJoinPool.commonPool-worker-1", ...).

Processing the messages includes pushing data to the Azure Storage Account which uses the reactive Azure code under the hood. The first time this happens the singleton instance of io.micrometer.context.ContextRegistry is initialized which means registering all classes implementing io.micrometer.context.ContextAccessor by using the Java Service Privider Intrerface (SPI).

So far so good ...

IF this is executed by the thread "org.springframework.kafka.KafkaListenerEndpointContainer#5-0-C-1" all is fine - all ContextAccessors are found by the SPI.

BUT IF this is executed by one of the worker threads "ForkJoinPool.commonPool-worker-1", etc. this fails meaning the SPI does not find the ContextsAccessors which leads to java.lang.IllegalStateException: No ContextAccessor for contextType: class reactor.util.context.Context1

So the interesting question is: Why does the SPI not find the ContextAccessors when called by the worker threads?

I've extended the io.micrometer.context.ContextRegistry with some logs printing out the used Classloader, all libs known to that CL, etc. - all looks fine. Even in the failing case the relevant libraries are known to the CL ...

So it looks like a problem with the Java Service Provider Interface itself in our used JRE (java17-debian12). To check this I've upgraded to JAVA 21 (java21-debian12) but the problem stays ...


Solution

  • After some more research it turns out that this is the result of

    I've created another StackOverflow question regarding this since it's not clear to me how to handle that correctly and this current question is too specific: How to handle the Spring Boot classloader 'problem'