MSMQ messages bound for clustered MSMQ instance get stuck in outgoing queues

E

3

7

We have clustered MSMQ for a set of NServiceBus services, and everything runs great until it doesn't. Outgoing queues on one server start filling up, and pretty soon the whole system is hung.

More details:

We have a clustered MSMQ between servers N1 and N2. Other clustered resources are only services that operate directly on the clustered queues as local, i.e. NServiceBus distributors.

All of the worker processes live on separate servers, Services3 and Services4.

For those unfamiliar with NServiceBus, work goes into a clustered work queue managed by the distributor. Worker apps on Service3 and Services4 send "I'm Ready for Work" messages to a clustered control queue managed by the same distributor, and the distributor responds by sending a unit of work to the worker process's input queue.

At some point, this process can get completely hung. Here is a picture of the outgoing queues on the clustered MSMQ instance when the system is hung:

Clustered MSMQ Outgoing Queues in Hung State

If I fail over the cluster to the other node, it's like the whole system gets a kick in the pants. Here is a picture of the same clustered MSMQ instance shortly after a failover:

Clustered MSMQ Outgoing Queues After Failover

Can anyone explain this behavior, and what I can do to avoid it, to keep the system running smoothly?

Expiable answered 6/10, 2010 at 16:3 Comment(6)

Does the secondary node eventually hang? How are the workers acting? Are they actively processing messages? – Sprage 7/10, 2010 at 0:12

It doesn't happen often enough that I can authoritatively say it happens on only one node or both. The workers are behaving - they are actively processing messages when there are messages in their local input queues to process. – Expiable 7/10, 2010 at 14:17

Weird. How often does it happen? How many NIC cards does each node have? I'm wondering if MSMQ is getting confused as to which card to use and therefore is occasionally not completing the ACKs back. There should be a registry setting to lock it in. – Sprage 8/10, 2010 at 13:54

It happens maybe 2-3 times per week. All servers involved (cluster nodes and worker nodes) are virtualized on VSphere. The clustered nodes are each on VSphere guests on separate hosts. In their virtual configurations, each server only has one NIC card. Of course with the clustered services, there are multiple IP addresses bouncing around. – Expiable 8/10, 2010 at 15:7

Did you ever figure this out? It's almost as if something is taking the node away from the Distributor. – Sprage 15/10, 2010 at 17:8

Not yet. Thought it might have something to do with the clustered instances being unable to bind to the correct IP address. There's a registry key that seems to address that, which seems to require a hotfix, but the hotfix would not install - said it did not apply to our OS (Windows 2008 Server). It seems to be running ok for the time being, but our 2 MSMQ clusters are running on different nodes in the cluster, i.e. not both on the same node. We're a bit nervous about what happens when we want to add a 3rd MSMQ instance. – Expiable 19/10, 2010 at 14:2

E

2

Over a year later, it seems that our issue has been resolved. The key takeaways seem to be:

Make sure you have a solid DNS system so when MSMQ needs to resolve a host, it can.
Only create one clustered instance of MSMQ on a Windows Failover Cluster.

When we set up our Windows Failover Cluster, we made the assumption that it would be bad to "waste" resources on the inactive node, and so, having two quasi-related NServiceBus clusters at the time, we made a clustered MSMQ instance for Project1, and another clustered MSMQ instance for Project2. Most of the time, we figured, we would run them on separate nodes, and during maintenance windows they would co-locate on the same node. After all, this was the setup we have for our primary and dev instances of SQL Server 2008, and that has been working quite well.

At some point I began to grow dubious about this approach, especially since failing over each MSMQ instance once or twice seemed to always get messages moving again.

I asked Udi Dahan (author of NServiceBus) about this clustered hosting strategy, and he gave me a puzzled expression and asked "Why would you want to do something like that?" In reality, the Distributor is very light-weight, so there's really not much reason to distribute them evenly among the available nodes.

After that, we decided to take everything we had learned and recreate a new Failover Cluster with only one MSMQ instance. We have not seen the issue since. Of course, making sure this problem is solved would be proving a negative, and thus impossible. It hasn't been an issue for at least 6 months, but who knows, I suppose it could fail tomorrow! Let's hope not.

Expiable answered 22/12, 2011 at 17:37 Comment(0)

S

2

Maybe your servers were cloned and thus share the same Queue Manager ID (QMId).

MSMQ use the QMId as a hash for caching the address of remote machines. If more than one machine has the same QMId in your network you could end up with stuck or missing messages.

Check out the explanation and solution in this blog post: Link

Sherburn answered 8/11, 2010 at 19:11 Comment(1)

This was not the case for me, but very good information. And, as seems to be par for the course with MSMQ, very well hidden. Hopefully it will help someone else. I, on the other hand, will keep searching... – Expiable 10/11, 2010 at 15:56

E

2

Over a year later, it seems that our issue has been resolved. The key takeaways seem to be:

Make sure you have a solid DNS system so when MSMQ needs to resolve a host, it can.
Only create one clustered instance of MSMQ on a Windows Failover Cluster.

When we set up our Windows Failover Cluster, we made the assumption that it would be bad to "waste" resources on the inactive node, and so, having two quasi-related NServiceBus clusters at the time, we made a clustered MSMQ instance for Project1, and another clustered MSMQ instance for Project2. Most of the time, we figured, we would run them on separate nodes, and during maintenance windows they would co-locate on the same node. After all, this was the setup we have for our primary and dev instances of SQL Server 2008, and that has been working quite well.

At some point I began to grow dubious about this approach, especially since failing over each MSMQ instance once or twice seemed to always get messages moving again.

I asked Udi Dahan (author of NServiceBus) about this clustered hosting strategy, and he gave me a puzzled expression and asked "Why would you want to do something like that?" In reality, the Distributor is very light-weight, so there's really not much reason to distribute them evenly among the available nodes.

After that, we decided to take everything we had learned and recreate a new Failover Cluster with only one MSMQ instance. We have not seen the issue since. Of course, making sure this problem is solved would be proving a negative, and thus impossible. It hasn't been an issue for at least 6 months, but who knows, I suppose it could fail tomorrow! Let's hope not.

Expiable answered 22/12, 2011 at 17:37 Comment(0)

M

1

How are your endpoints configured to persist their subscriptions?

What if one (or more) of your service encounters an error and is restartet by the Failoverclustermanager? In this case, this service would never receive one of the "I'm Ready for Work" message from the other services again.

When you fail over to the other node, I guess that all your services send these messages again and, as a result, everything gets back working.

To test this behavior do the following.

Stop and restart all your services.
Stop only one of the services.
Restart the stopped service.
If your system does not hang, repeat this with each single service.

If your system now hangs again, check your configurations. It this scenario your at least one, if not all, services lose the subscriptions between restarts. If you did not do so already, persist the subscription in a database.

Marder answered 13/10, 2010 at 15:20 Comment(2)

Subscriptions are already persisted in a shared database. The clustered distributor stores its state in a clustered MSMQ queue. If a worker is restarted by the failover cluster manager, one of the first things it does (on any startup) is to send the ReadyMessage. – Expiable 13/10, 2010 at 19:22

It is true that the worker sends the ReadyMessage on start. I am asking for the persisted Subscriptions because I had a similar problem. One of the subscriptions was not correctly saved in DB, so after a restart, although it send its message, the others completely ignored it because they checked the db only. Only exception of this was when all services were together restarted, then the messages of the service in question were received again. On service restart: Messages failed again. – Marder 15/10, 2010 at 11:15

Recommended topics

Hot tags