We are running Keycloak 21.1.2 on Kubernetes using DNS_PING with a headless service for discovering nodes in the cluster. While building the Keycloak image, we set the cache-config-file location to "cache-ispn.xml" which is relative to /opt/keycloak/conf directory, which is the default file location for Keycloak but we override number of owners to 1.
During deployment, based on the number of nodes in the cluster we update owners in the cache-config.xml file and mount it at the same location. All nodes in the cluster are discovered and everything seems to works fine. What I am not sure is whether this is allowed and how to confirm that replication is working as configured from metrics and logs?
I tried enabling metrics and logs for infinispan cluster as defined here Configuring distributed caches - Keycloak, but none of the metrics indicate whether replication is working like I configured during the deployment. And logs are overwhelming and i am not sure what to look for. I also tried setting the log level to KC_LOG_LEVEL: info,org.keycloak.connections.infinispan:TRACE,org.keycloak.connections:TRACE.
Following are my observations after some testing.
TLDR: Values in cache-ispn.xml during build are taken into consideration and any updated values during deployment are ignored if there is no explicit rebuild during deployment.
Initially i enabled infinispan logs by settings Keycloak debug log level to
INFO,org.keycloak:DEBUG,org.keycloak.connections:TRACE,org.keycloak.connections.infinispan:TRACE,org.infinispan:TRACE
understand how replication is working by tracking session id in the logs and tracking rpc commands. But these logs where overwhelming and was impossible to understand what exactly is happening regarding who owns a object and where its getting replicated to.
I gave up this approach and enabled statistics as described at https://www.keycloak.org/server/caching . I had enable statistics both at global and for each cache to get full metrics. Corresponding metrics endpoint is at /auth/metrics endpoint. Metrics of interest regarding my usecase had following naming convention
vendor_cache_manager_keycloak_cache_<cache-name>_cluster_cache_stats_required_minimum_number_of_nodes
For example for my deployment configuration with 2 nodes in Keycloak cluster
...
<distributed-cache name="sessions" owners="1" statistics="true">
<expiration lifespan="-1"/>
</distributed-cache>
...
...
<distributed-cache name="sessions" owners="2" statistics="true">
<expiration lifespan="-1"/>
</distributed-cache>
...
I had following metric for sessions cache
# HELP vendor_cache_manager_keycloak_cache_sessions_cluster_cache_stats_required_minimum_number_of_nodes Minimum number of nodes to avoid losing data
# TYPE vendor_cache_manager_keycloak_cache_sessions_cluster_cache_stats_required_minimum_number_of_nodes gauge
vendor_cache_manager_keycloak_cache_sessions_cluster_cache_stats_required_minimum_number_of_nodes{cache="sessions",node="keycloak-0-14766",} 2.0
Value here indicates minimum number of nodes that should be available to result in no data loss. Here it implies my setup cannot afford any nodes going down since my build configuration with only one owner per object is effective and not the updated cache-ispn.xml during deployment, which has two replicas and can afford one node down without any dataloss.
I tried with different build and deploy configurations to come to this conclusion. It would have been great if there is more visibility into replication and Keycloak documentation to understand replication issues.