[SOLVED] GridGain upgrade is failing with IgniteSpiException

GridGain upgrade is failing with IgniteSpiException

Environments used: AWS, Docker desktop

Existing image: gridgain/community:8.8.27-openjdk11-slim

New image: gridgain/community:8.8.34-openjdk17-slim

Both environments GirdGain cluster (3 nodes) is running with persistence enabled. Tried upgrading the GridGain Community Edition version from 8.8.27 to 8.8.34 with kubectl apply command.

In local docker desktop, it got upgraded without any issues.

While executing the kubectl command in AWS k8s, one of the gridgain pod got terminated and new one got created with the new gridgain image. The node fails to start with the below exception and gets restarted in a loop.

Caused by: class org.apache.ignite.spi.IgniteSpiException: Local node and remote node have different version numbers (node will not join, Ignite does not support rolling updates, so versions must be exactly the same) [locBuildVer=8.8.27, rmtBuildVer=8.8.34, locNodeAddrs=[gridgain-cluster-1.gridgain-service.or.svc.cluster.local/0:0:0:0:0:0:0:1%lo, /127.0.0.1], rmtNodeAddrs=[gridgain-cluster-2.gridgain-service.or.svc.cluster.local/0:0:0:0:0:0:0:1%lo, /127.0.0.1], locNodeId=7fa4..., rmtNodeId=a41d...]

Deactivated the gridgain cluster and tried, it failed with same exception.

How do I perform the gridgain upgrade without losing the persisted data?

Solution

You can't run multiple nodes using the different GridGain versions in the same cluster unless the Rolling Upgrade feature is enabled, which is available in GridGain Enterprise and Ultimate editions.

In case you have the Ignite Persistence enabled and properly configured persistent volumes in AWS, you need to do the next steps to upgrade:

Stop an entire cluster by scaling the StatefulSet down to 0
Update the docker image version
Start the cluster by scaling it to 3 nodes

You should expect to observe downtime during the upgrade process since without the Rolling Upgrade feature it's required to stop all the nodes, but the persisted data won't be lost and will be available once the cluster is up.