msmqmessage-queuenservicebuscluster-computingmsdtc

MSMQ messages bound for clustered MSMQ instance get stuck in outgoing queues


We have clustered MSMQ for a set of NServiceBus services, and everything runs great until it doesn't. Outgoing queues on one server start filling up, and pretty soon the whole system is hung.

More details:

We have a clustered MSMQ between servers N1 and N2. Other clustered resources are only services that operate directly on the clustered queues as local, i.e. NServiceBus distributors.

All of the worker processes live on separate servers, Services3 and Services4.

For those unfamiliar with NServiceBus, work goes into a clustered work queue managed by the distributor. Worker apps on Service3 and Services4 send "I'm Ready for Work" messages to a clustered control queue managed by the same distributor, and the distributor responds by sending a unit of work to the worker process's input queue.

At some point, this process can get completely hung. Here is a picture of the outgoing queues on the clustered MSMQ instance when the system is hung:

Clustered MSMQ Outgoing Queues in Hung State

If I fail over the cluster to the other node, it's like the whole system gets a kick in the pants. Here is a picture of the same clustered MSMQ instance shortly after a failover:

Clustered MSMQ Outgoing Queues After Failover

Can anyone explain this behavior, and what I can do to avoid it, to keep the system running smoothly?


Solution

  • Over a year later, it seems that our issue has been resolved. The key takeaways seem to be:

    When we set up our Windows Failover Cluster, we made the assumption that it would be bad to "waste" resources on the inactive node, and so, having two quasi-related NServiceBus clusters at the time, we made a clustered MSMQ instance for Project1, and another clustered MSMQ instance for Project2. Most of the time, we figured, we would run them on separate nodes, and during maintenance windows they would co-locate on the same node. After all, this was the setup we have for our primary and dev instances of SQL Server 2008, and that has been working quite well.

    At some point I began to grow dubious about this approach, especially since failing over each MSMQ instance once or twice seemed to always get messages moving again.

    I asked Udi Dahan (author of NServiceBus) about this clustered hosting strategy, and he gave me a puzzled expression and asked "Why would you want to do something like that?" In reality, the Distributor is very light-weight, so there's really not much reason to distribute them evenly among the available nodes.

    After that, we decided to take everything we had learned and recreate a new Failover Cluster with only one MSMQ instance. We have not seen the issue since. Of course, making sure this problem is solved would be proving a negative, and thus impossible. It hasn't been an issue for at least 6 months, but who knows, I suppose it could fail tomorrow! Let's hope not.