TIBCO EMS Failover reconnect for C# (TIBCO.EMS.dll)
Asked Answered
A

3

14

We have a TIBCO EMS solution that uses built-in server failover in a 2-4 server environment. If the TIBCO admins fail-over services from one EMS server to another, connections are supposed to be transfered to the new server automatically at the EMS service level. For our C# applications using the EMS service, this is not happening - our user connections are not being transfered to the new server after failover and we're not sure why.

Our application connection to EMS at startup only so if the TIBCO admins failover after users have started our application, they users need to restart the app in order to reconnect to the new server (our EMS connection uses a server string including all 4 production EMS servers - if the first attempt fails, it moves to the next server in the string and tries again).

I'm looking for an automated approach that will attempt to reconnect to EMS periodically if it detects that the connection is dead but I'm not sure how best to do that.

Any ideas? We are using TIBCO.EMS.dll version 4.4.2 and .Net 2.x (SmartClient app)

Any help would be appreciated.

Astrea answered 9/10, 2008 at 15:37 Comment(8)
How are you currently implementing fault tolerance? On the server in the 'factories.conf' file? Does your 'url' property contain the comma separated list of URLs specified in the 'tib_ems_dotnet_ref.pdf' on page 134?Saintsimon
Yes, and that, believe it or not is the problem. Connections should be transfered from one server to another when the EMS server is failed over. This should work when you have a delimited list of EMS servers in your connection string but I believe there's a bug in the EMS.lib making it not workAstrea
Is an error thrown after failover by the listeners? Is an error thrown after failover by the producers (when sending a message)? Most likely, the library expects you to reconnect. The providing multiple servers in the connection string just lets it round robin during connect - not later...Overburdensome
I hesitate to dive into a client programming solution when we're not sure exactly what the server is doing. Can you provide more information on how/when you know failover is not working? (note: client failover notification - tib ems user guide pg 292, tib dotnet ref pg 220).Saintsimon
ConnectionAttempts sets the number of times a client will re-try to connect to the server. ReconnectAttempts sets the number of times a client will try to re-connect after a network disconnect. Neither of those appear to do what they say. Yes, we get exceptions on Publish.Astrea
ajmastrean our clients are probably not getting a notification. They basically subscribe to message and infrequently publish (only, in fact, when we tell them to remotely via EMS). The server is what's causing the issue - when they fail-over to another server, the clients become disconnectedAstrea
Scott, (let's get a clear picture of the env) 1. If it's not done, setup your client to receive failover notification from the server (see the pages noted in my prev comment) 2. Setup your connection to have an ExceptionListener 3. Confirm that the EMSException contains the URL of the backup serverSaintsimon
ajmastrean I'll run a test to make sure it is cycling through the servers.Astrea
S
6

This post should sum up my current comments and explain my approach in more detail...

The TIBCO 'ConnectionFactory' and 'Connection' types are heavyweight, thread-safe types. TIBCO suggests that you maintain the use of one ConnectionFactory (per server configured factory) and one Connection per factory.

The server also appears to be responsible for in-place 'Connection' failover and re-connection, so let's confirm it's doing its job and then lean on that feature.

Creating a client side solution is going to be slightly more involved than fixing a server or client setup problem. All sessions you have created from a failed connection need to be re-created (not to mention producers, consumers, and destinations). There are no "reconnect" or "refresh" methods on either type. The sessions do not maintain a reference to their parent connection either.

You will have to manage a lookup of connection/session objects and go nuts re-initializing everyone! or implement some sort of session failure event handler that can get the new connection and reconnect them.

So, for now, let's dig in and see if the client is setup to receive failover notification (tib ems users guide pg 292). And make sure the raised exception is caught, contains the failover URL, and is being handled properly.

Saintsimon answered 16/10, 2008 at 21:19 Comment(6)
I've set up Tibems.SetExceptionOnFTSwitch(true) so I'm now seeing when our connection fails. We have not yet been able to test in a failover setting but I am getting the exceptions when the connection goes away. Built-in reconnect isn't working when the server comes back though.Astrea
We've checked server-client and client-server heartbeats. They were disabled in the test environment and I thought that might be the cause for reconnect not working. Enabled, set to 10s and we still don't get reconnect attempts.Astrea
The fact that I can trap for server connection failure is a good sign but I'd rather have the reconnect logic from the EMS library do its job than have to loop reconnect tries manually.Astrea
I'm in a similar situation. I'm refactoring an older messaging system that was doing infinite loops trying to re-initialize the Connection instance... I'm hoping to get away from that. I don't have much practical knowledge though, I've been working mostly from the TIBCO docs.Saintsimon
ajmastream meaning of course that you have access to the docs. I basically had to beg to get access. They were locked on a network drive so I couldn't even read them - I've been flying blind in EMS ever since we started implementing.Astrea
I'll post here if TIBCO has anything to say on the subject - our TIB Admin is putting in a problem ticket including some of my code - or if we resolve the issue in some other way. Thanks for your help.Astrea
A
9

First off, yes, I am answering my own question. Its important to note, however, that without ajmastrean, I would be nowhere. thank you so much!

ONE: ConnectionFactory.SetReconnAttemptCount, SetReconnAttemptDelay, SetReconnAttemptTimeout should be set appropriately. I think the default values re-try too quickly (on the order of 1/2 second between retries). Our EMS servers can take a long time to failover because of network storage, etc - so 5 retries at 1/2s intervals is nowhere near long enough.

TWO: I believe its important to enable the client-server and server-client heartbeats. Wasn't able to verify but without those in place, the client might not get the notification that the server is offline or switching in failover mode. This, of course, is a server side setting for EMS.

THREE: you can watch for failover event by setting Tibems.SetExceptionOnFTSwitch(true); and then wiring up a exception event handler. When in a single-server environment, you will see a "Connection has been terminated" message. However, if you are in a fault-tolerant multi-server environment, you will see this: "Connection has performed fault-tolerant switch to ". You don't strictly need this notification, but it can be useful (especially in testing).

FOUR: Apparently not clear in the EMS documentation, connection reconnect will NOT work in a single-server environment. You need to be in a multi-server, fault tolerant environment. There is a trick, however. You can put the same server in the connection list twice - strange I know, but it works and it enables the built-in reconnect logic to work.

some code:

private void initEMS()
{
    Tibems.SetExceptionOnFTSwitch(true);
    _ConnectionFactory = new TIBCO.EMS.TopicConnectionFactory(<server>);
    _ConnectionFactory.SetReconnAttemptCount(30);       // 30retries
    _ConnectionFactory.SetReconnAttemptDelay(120000);   // 2minutes
    _ConnectionFactory.SetReconnAttemptTimeout(2000);   // 2seconds
_Connection = _ConnectionFactory.CreateTopicConnectionM(<username>, <password>);
    _Connection.ExceptionHandler += new EMSExceptionHandler(_Connection_ExceptionHandler);
}
private void _Connection_ExceptionHandler(object sender, EMSExceptionEventArgs args)
{
    EMSException e = args.Exception;
    // args.Exception = "Connection has been terminated" -- single server failure
    // args.Exception = "Connection has performed fault-tolerant switch to <server url>" -- fault-tolerant multi-server
    MessageBox.Show(e.ToString());
}
Astrea answered 24/10, 2008 at 19:31 Comment(0)
S
6

This post should sum up my current comments and explain my approach in more detail...

The TIBCO 'ConnectionFactory' and 'Connection' types are heavyweight, thread-safe types. TIBCO suggests that you maintain the use of one ConnectionFactory (per server configured factory) and one Connection per factory.

The server also appears to be responsible for in-place 'Connection' failover and re-connection, so let's confirm it's doing its job and then lean on that feature.

Creating a client side solution is going to be slightly more involved than fixing a server or client setup problem. All sessions you have created from a failed connection need to be re-created (not to mention producers, consumers, and destinations). There are no "reconnect" or "refresh" methods on either type. The sessions do not maintain a reference to their parent connection either.

You will have to manage a lookup of connection/session objects and go nuts re-initializing everyone! or implement some sort of session failure event handler that can get the new connection and reconnect them.

So, for now, let's dig in and see if the client is setup to receive failover notification (tib ems users guide pg 292). And make sure the raised exception is caught, contains the failover URL, and is being handled properly.

Saintsimon answered 16/10, 2008 at 21:19 Comment(6)
I've set up Tibems.SetExceptionOnFTSwitch(true) so I'm now seeing when our connection fails. We have not yet been able to test in a failover setting but I am getting the exceptions when the connection goes away. Built-in reconnect isn't working when the server comes back though.Astrea
We've checked server-client and client-server heartbeats. They were disabled in the test environment and I thought that might be the cause for reconnect not working. Enabled, set to 10s and we still don't get reconnect attempts.Astrea
The fact that I can trap for server connection failure is a good sign but I'd rather have the reconnect logic from the EMS library do its job than have to loop reconnect tries manually.Astrea
I'm in a similar situation. I'm refactoring an older messaging system that was doing infinite loops trying to re-initialize the Connection instance... I'm hoping to get away from that. I don't have much practical knowledge though, I've been working mostly from the TIBCO docs.Saintsimon
ajmastream meaning of course that you have access to the docs. I basically had to beg to get access. They were locked on a network drive so I couldn't even read them - I've been flying blind in EMS ever since we started implementing.Astrea
I'll post here if TIBCO has anything to say on the subject - our TIB Admin is putting in a problem ticket including some of my code - or if we resolve the issue in some other way. Thanks for your help.Astrea
O
1

Client applications may receive notification of a failover by setting the tibco.tibjms.ft.switch.exception system property

Perhaps the library needs that to work?

Overburdensome answered 16/10, 2008 at 21:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.