[SOLVED] What is the correct way to use the timeout manager with the distributor in NServiceBus 3+?

What is the correct way to use the timeout manager with the distributor in NServiceBus 3+?

Version pre-3 the recommendation was to run a timeout manager as a standalone process on your cluster, beside the distributor. (As detailed here: http://support.nservicebus.com/customer/portal/articles/965131-deploying-nservicebus-in-a-windows-failover-cluster).

After the inclusion of the timeout manager as a satellite assembly, what is the correct way to use it when scaling out with the distributor?

Should each worker of Service A run with timeout manager enabled or should only the distributor process for service A be configured to run a timeout manager for service A?

If each worker runs it, do they share the same Raven instance for storing the timeouts? (And if so, how do you make sure that two or more workers don't pick up the same expired timeout at the same time?)

Solution

Allow me to answer this clearly myself.

After a lot of digging and with help from Andreas Öhlund on the NSB team(http://tech.groups.yahoo.com/group/nservicebus/message/17758), the correct anwer to this question is:

Like Udi Dahan mentioned, by design ONLY the distributor/master node should run a timeout manager in a scale out scenario.
Unfortunately in early versions of NServiceBus 3 this is not implemented as designed.

You have the following 3 issues:

1) Running with the Distributor profile does NOT start a timeout manager.

Workaround:

Start the timeout manager on the distributor yourself by including this code on the distributor:

class DistributorProfileHandler : IHandleProfile<Distributor> 
{
   public void ProfileActivated()
   {
       Configure.Instance.RunTimeoutManager();
   }
}

If you run the Master profile this is not an issue as a timeout manager is started on the master node for you automatically.

2) Workers running with the Worker profile DO each start a local timeout manager.

This is not as designed and messes up the polling against the timeout store and dispatching of timeouts. All workers poll the timeout store with "give me the imminent timeouts for MASTERNODE". Notice they ask for timeouts of MASTERNODE, not for W1, W2 etc. So several workers can end up fetching the same timeouts from the timeout store concurrently, leading to conflicts against Raven when deleting the timeouts from it.

The dispatching always happens through the LOCAL .timouts/.timeoutsdispatcher queues, while it SHOULD be through the queues of the timeout manager on the MasterNode/Distributor.

Workaround, you'll need to do both:

a) Disable the timeout manager on the workers. Include this code on your workers

class WorkerProfileHandler:IHandleProfile<Worker>
{
    public void ProfileActivated()
    {
        Configure.Instance.DisableTimeoutManager();
    }
}

b) Reroute NServiceBus on the workers to use the .timeouts queue on the MasterNode/Distributor.

If you don't do this, any call to RequestTimeout or Defer on the worker will die with an exception saying that you have forgotten to configure a timeout manager. Include this in your worker config:

<UnicastBusConfig TimeoutManagerAddress="{endpointname}.Timeouts@{masternode}" />

3) Erroneous "Ready" messages back to the distributor.

Because the timeout manager dispatches the messages directly to the workers input queues without removing an entry from the available workers in the distributor storage queue, the workers send erroneous "Ready" messages back to the distributor after handling a timeout. This happens even if you have fixed 1 and 2, and it makes no difference if the timeout was fetched from a local timeout manager on the worker or on one running on the distributor/MasterNode. The consequence is a build up of an extra entry in the storage queue on the distributor for each timeout handled by a worker.

Workaround: Use NServiceBus 3.3.15 or later.