azure-service-fabricservice-fabric-stateful

Service Fabric Replica Stuck


I am upgrading an application on Service Fabric and one of the replicas is showing the following warning:

Unhealthy event: SourceId='System.RAP', Property='IStatefulServiceReplica.ChangeRole(S)Duration', HealthState='Warning', ConsiderWarningAsError=false. The api IStatefulServiceReplica.ChangeRole(S) on node _gtmsf1_0 is stuck. Start Time (UTC): 2018-03-21 15:49:54.326.

After some debugging, I suspect I'm not properly honoring a cancellation token. In the meantime, how do I safely force a restart of this stuck replica to get the service working again?

Partial results of Get-ServiceFabricDeployedReplica:

...
ReplicaRole                : ActiveSecondary
ReplicaStatus              : Ready
ServiceTypeName            : MarketServiceType
...
ServicePackageActivationId : 
CodePackageName            : Code
...
HostProcessId              : 6180
ReconfigurationInformation : {
                             PreviousConfigurationRole            : Primary
                             ReconfigurationPhase                 : Phase0
                             ReconfigurationType                  : SwapPrimary
                             ReconfigurationStartTimeUtc          : 3/21/2018 3:49:54 PM
                             }

Solution

  • You might be able to pipe that directly to Restart-ServiceFabricReplica. If that remains stuck, then you should be able to use Get-ServiceFabricDeployedCodePackage and Restart-ServiceFabricDeployedCodePackage to restart the surrounding process. Since Restart-ServiceFabricDeployedCodePackage has options for selecting random packages to simulate failure, just be sure to target the specific code package you're interested in restarting.