What is the correct way to use the timeout manager with the distributor in NServiceBus 3+?
Asked Answered
T

2

7

Version pre-3 the recommendation was to run a timeout manager as a standalone process on your cluster, beside the distributor. (As detailed here: http://support.nservicebus.com/customer/portal/articles/965131-deploying-nservicebus-in-a-windows-failover-cluster).

After the inclusion of the timeout manager as a satellite assembly, what is the correct way to use it when scaling out with the distributor?

Should each worker of Service A run with timeout manager enabled or should only the distributor process for service A be configured to run a timeout manager for service A?

If each worker runs it, do they share the same Raven instance for storing the timeouts? (And if so, how do you make sure that two or more workers don't pick up the same expired timeout at the same time?)

Tourneur answered 5/2, 2013 at 22:19 Comment(0)
T
10

Allow me to answer this clearly myself.

After a lot of digging and with help from Andreas Öhlund on the NSB team(http://tech.groups.yahoo.com/group/nservicebus/message/17758), the correct anwer to this question is:

  • Like Udi Dahan mentioned, by design ONLY the distributor/master node should run a timeout manager in a scale out scenario.
  • Unfortunately in early versions of NServiceBus 3 this is not implemented as designed.

You have the following 3 issues:

1) Running with the Distributor profile does NOT start a timeout manager.

Workaround:

Start the timeout manager on the distributor yourself by including this code on the distributor:

class DistributorProfileHandler : IHandleProfile<Distributor> 
{
   public void ProfileActivated()
   {
       Configure.Instance.RunTimeoutManager();
   }
}

If you run the Master profile this is not an issue as a timeout manager is started on the master node for you automatically.

2) Workers running with the Worker profile DO each start a local timeout manager.

This is not as designed and messes up the polling against the timeout store and dispatching of timeouts. All workers poll the timeout store with "give me the imminent timeouts for MASTERNODE". Notice they ask for timeouts of MASTERNODE, not for W1, W2 etc. So several workers can end up fetching the same timeouts from the timeout store concurrently, leading to conflicts against Raven when deleting the timeouts from it.

The dispatching always happens through the LOCAL .timouts/.timeoutsdispatcher queues, while it SHOULD be through the queues of the timeout manager on the MasterNode/Distributor.

Workaround, you'll need to do both:

a) Disable the timeout manager on the workers. Include this code on your workers

class WorkerProfileHandler:IHandleProfile<Worker>
{
    public void ProfileActivated()
    {
        Configure.Instance.DisableTimeoutManager();
    }
}

b) Reroute NServiceBus on the workers to use the .timeouts queue on the MasterNode/Distributor.

If you don't do this, any call to RequestTimeout or Defer on the worker will die with an exception saying that you have forgotten to configure a timeout manager. Include this in your worker config:

<UnicastBusConfig TimeoutManagerAddress="{endpointname}.Timeouts@{masternode}" /> 

3) Erroneous "Ready" messages back to the distributor.

Because the timeout manager dispatches the messages directly to the workers input queues without removing an entry from the available workers in the distributor storage queue, the workers send erroneous "Ready" messages back to the distributor after handling a timeout. This happens even if you have fixed 1 and 2, and it makes no difference if the timeout was fetched from a local timeout manager on the worker or on one running on the distributor/MasterNode. The consequence is a build up of an extra entry in the storage queue on the distributor for each timeout handled by a worker.

Workaround: Use NServiceBus 3.3.15 or later.

Tourneur answered 10/2, 2013 at 17:41 Comment(4)
+1 thanks for sharing this awesome and detailed answer. it is a service to the community.Byproduct
Thanks for this. We have implemented these workarounds, but are still seeing extra 'ready' messages in the storage queue. I believe it is due to your #3. Unfortunately, it looks like Andreas closed issue 954 a few days ago.Statis
Hi Yobi21! Do the extra ready messages build up over time or do you just get a couple of extra at startup? If it's the later, I've reported a race condition between the "Worker started" and "Worker ready for a new message" messages that can occur at when you start up your endpoint, especially if the endpoint subscribes to any of its own events. Not a big deal, but last I heard it's scheduled to be fixed in 4.1. Check out: github.com/NServiceBus/NServiceBus/issues/978Tourneur
Documented the issue in this thread linkStatis
S
1

In version 3+ we created the concept of a master node which hosts inside it all the satellites like the distributor, timeout manager, gateway, etc.

The master node is very simple to run - you just pass a /master flag to the NServiceBus.Host.exe process and it runs everything for you. So, from a deployment perspective, where you used to deploy the distributor, now you deploy the master node.

Syllabary answered 6/2, 2013 at 6:48 Comment(5)
Okey, fair enough. So no timeout manager per worker then. But the master profile also makes the process behave like a worker, it partakes as a receiver and handler of messages in the load balancing just like the other workers. I'd like to keep that work of the clustred nodes to maximize the resources available to do the actual distribution to the other workers. Is there a supported way to do it when using the distributor profile at the distributor, and not the master profile? Or some way to make sure that the master node does not behave as a worker?Tourneur
See the "feature related profiles" section of this page: support.nservicebus.com/customer/portal/articles/…Syllabary
I have and I can't see anything there about whether the workers should run the timeout manager or not?Tourneur
@UdiDahan I do wish you would elaborate on your answer as the documentation you are refering does not answer janovesk question in a good way.Chinn
I've checked and we run the timeout management by default in each node (master and worker). We'll change that in version 4.0 and have it only run on the master node.Syllabary

© 2022 - 2024 — McMap. All rights reserved.