keycloakinfinispan

Keycloak infinispan cache configuration change at startup without rebuild


We are running Keycloak 21.1.2 on Kubernetes using DNS_PING with a headless service for discovering nodes in the cluster. While building the Keycloak image, we set the cache-config-file location to "cache-ispn.xml" which is relative to /opt/keycloak/conf directory, which is the default file location for Keycloak but we override number of owners to 1.

During deployment, based on the number of nodes in the cluster we update owners in the cache-config.xml file and mount it at the same location. All nodes in the cluster are discovered and everything seems to works fine. What I am not sure is whether this is allowed and how to confirm that replication is working as configured from metrics and logs?

I tried enabling metrics and logs for infinispan cluster as defined here Configuring distributed caches - Keycloak, but none of the metrics indicate whether replication is working like I configured during the deployment. And logs are overwhelming and i am not sure what to look for. I also tried setting the log level to KC_LOG_LEVEL: info,org.keycloak.connections.infinispan:TRACE,org.keycloak.connections:TRACE.


Solution

  • Following are my observations after some testing.

    TLDR: Values in cache-ispn.xml during build are taken into consideration and any updated values during deployment are ignored if there is no explicit rebuild during deployment.

    From logs

    Initially i enabled infinispan logs by settings Keycloak debug log level to

    INFO,org.keycloak:DEBUG,org.keycloak.connections:TRACE,org.keycloak.connections.infinispan:TRACE,org.infinispan:TRACE

    understand how replication is working by tracking session id in the logs and tracking rpc commands. But these logs where overwhelming and was impossible to understand what exactly is happening regarding who owns a object and where its getting replicated to.

    Using infinispan statistics

    I gave up this approach and enabled statistics as described at https://www.keycloak.org/server/caching . I had enable statistics both at global and for each cache to get full metrics. Corresponding metrics endpoint is at /auth/metrics endpoint. Metrics of interest regarding my usecase had following naming convention

    vendor_cache_manager_keycloak_cache_<cache-name>_cluster_cache_stats_required_minimum_number_of_nodes

    For example for my deployment configuration with 2 nodes in Keycloak cluster

    Cache configuration during docker image build

    ...
    <distributed-cache name="sessions" owners="1" statistics="true">
        <expiration lifespan="-1"/>
    </distributed-cache>
    ...
    

    Cache configuration during deployment(my mounting custom cache-ispn.xml file)

    ...
    <distributed-cache name="sessions" owners="2" statistics="true">
        <expiration lifespan="-1"/>
    </distributed-cache>
    ...
    

    I had following metric for sessions cache

    # HELP vendor_cache_manager_keycloak_cache_sessions_cluster_cache_stats_required_minimum_number_of_nodes Minimum number of nodes to avoid losing data
    # TYPE vendor_cache_manager_keycloak_cache_sessions_cluster_cache_stats_required_minimum_number_of_nodes gauge
    vendor_cache_manager_keycloak_cache_sessions_cluster_cache_stats_required_minimum_number_of_nodes{cache="sessions",node="keycloak-0-14766",} 2.0
    

    Value here indicates minimum number of nodes that should be available to result in no data loss. Here it implies my setup cannot afford any nodes going down since my build configuration with only one owner per object is effective and not the updated cache-ispn.xml during deployment, which has two replicas and can afford one node down without any dataloss.

    I tried with different build and deploy configurations to come to this conclusion. It would have been great if there is more visibility into replication and Keycloak documentation to understand replication issues.