azure-service-fabricservice-fabric-stateful

Service Fabric Statefull Service how to avoid InvalidOperationException on long running transaction with large values


I have a fairly long running transaction that updates large values (dictionaries) on multiple reliable collections. I keep running into InvalidOperationExceptions (Transaction is committing or rolling back), and retrying the operation just results in the exception again. Is there anything I can do to mitigate the issue?

I assume this is related to the transaction blocking truncation of the transaction log like it says in the docs. Would making the log larger or smaller help?

https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-reliable-collections-guidelines

Do handle InvalidOperationException. User transactions can be aborted by the system for variety of reasons. For example, when the Reliable State Manager is changing its role out of Primary or when a long-running transaction is blocking truncation of the transactional log. In such cases, user may receive InvalidOperationException indicating that their transaction has already been terminated. Assuming, the termination of the transaction was not requested by the user, best way to handle this exception is to dispose the transaction, check if the cancellation token has been signaled (or the role of the replica has been changed), and if not create a new transaction and retry.


Solution

  • Well not sure exactly which setting did it, but I got it working with the following

            new ReliableStateManagerConfiguration(new ReliableStateManagerReplicatorSettings
            {
                CheckpointThresholdInMB = 4096,
                MaxRecordSizeInKB = 1024 * 1024,
                MinLogSizeInMB = 4096,
            })))
    

    Thinking that bumbing up the MinLogSizeInMB stopped it from attempting to truncate the log during the transaction due to the large values.

    Unfortunately this just allows a single transaction to complete. Eventually the log has to truncate and whichever transaction is in progress when that happens fails and has to be retried. Would be nice if there was some way around that. I'm tempted to make the MinLogSizeInMB a huge number to make these errors as few and far between as possible.