Weird behavior of Quickfix/j after "Sent test request TEST" happens
Asked Answered
S

2

5

I'm using Quickfix/J to receive the message but "Sent test request TEST" happen. The log file (FIX.4.2-A-B.event.log)shows that:

23:19:05: Sent test request TEST   
23:19:32: Disconnecting: Timed out waiting for heartbeat   
23:19:33: Initiated logon request   
23:19:44: Disconnecting: Timed out waiting for logon response   
23:19:45: Initiated logon request   
23:19:56: Disconnecting: Timed out waiting for logon response  ...

But I find something in another log file(FIX.4.2-A-B.message.log):

8=FIX.4.2|9=68|35=1|34=250|49=A|52=20140224-23:19:05.909|56=B|112=TEST|10=106 
8=FIX.4.2|9=74|35=0|49=B|56=A|43=N|34=1320|52=20140224-23:19:23.381|112=TEST|10=130 

Which apparently indicates that the counter party B has already sent back the heart beat to us.

And the FIX.4.2-A-B.messages.log file is still growing!!!!!!!!!!! The file keep receiving messages but the Quickfix/J process none(nothing happens within onMessage() method)!!!!

Please help to walk me through why this is happening? why the connection is still lost after receiving heartbeat and the log is disconnecting?


Since this problem hasn't been solved yet. Here is the update:

My config:

[default]
FileStorePath=/target/data/fixapplication
ConnectionType=initiator
SenderCompID=A
TargetCompID=B
SocketConnectHost=xxx.xxx.xxx.xx
StartTime=00:00:00
EndTime=00:00:00
HeartBtInt=30
ReconnectInterval=5
FileLogPath=logs/fix/

[session]
BeginString=FIX.4.2
SocketConnectPort=xxx
ValidateFieldsOutOfOrder=N
ResetOnLogon=Y

Here's the codes:

@Override
    public void fromAdmin(Message message, SessionID sessionID) throws FieldNotFound,
            IncorrectDataFormat, IncorrectTagValue, RejectLogon {
        if (adminLog.isInfoEnabled())
            adminLog.info("Inside fromAdmin(): " + message.getHeader().getString(MsgType.FIELD)
                    + " ; " + message);
    }

    @Override
    public void fromApp(Message message, SessionID sessionID) throws FieldNotFound,
            IncorrectDataFormat, IncorrectTagValue, UnsupportedMessageType {

        crack(message, sessionID);
    }

    @Override
    public void onCreate(SessionID arg0) {

    }

    @Override
    public void onLogon(SessionID session) {
        if (adminLog.isInfoEnabled()) {
            adminLog.info("Inside onLogon: Logon completed " + session.toString());
        }
    }

    @Override
    public void onLogout(SessionID sessionId) {
        if (adminLog.isInfoEnabled()) {
            adminLog.info("Inside onLogout: Logout completed " + sessionId.toString());
        }
    }

Basically I let Quickfix/J to handle the connection itself.

The problem is that the app still keeping receiving messages but not processing them, and the log file shows that it's not connecting. All the messages are in the FIX.4.2-A-B.message.log, that's why it keep growing.

I find some similar cases:

http://www.quickfixj.org/jira/browse/QFJ-668

http://www.quickfixj.org/jira/browse/QFJ-624

http://quickfix-j.364392.n2.nabble.com/Timed-out-waiting-for-heartbeat-td365186.html

http://www.quickfixj.org/jira/browse/QFJ-759

Unfortunately none of them provides solution. So Please help me out.


The second update is as follow:

I extract the log files (I'm A and counter party is B):

8=FIX.4.2|9=58|35=0|34=49|49=A|52=23:23:50.075|56=B|10=030|
8=FIX.4.2|9=58|35=0|34=50|49=A|52=23:24:20.074|56=B|10=019|
**8=FIX.4.2|9=67|35=1|34=51|49=A|52=23:24:46.074|56=B|112=TEST|10=047|**
**8=FIX.4.2|9=74|35=0|49=B|56=A|43=N|34=1103|52=23:24:59.060|112=TEST|10=125|**
8=FIX.4.2|9=75|35=A|34=1|49=A|52=23:25:14.076|56=B|98=0|108=30|141=Y|10=059|
8=FIX.4.2|9=74|35=A|49=B|56=A|43=N|34=1|52=23:25:27.076|98=0|108=30|10=004|
8=FIX.4.2|9=86|35=1|49=B|56=A|43=N|34=2|52=23:25:27.077|112=03/03/2014-07:25:27|10=117|
8=FIX.4.2|9=75|35=A|34=1|49=A|52=23:25:26.076|56=B|98=0|108=30|141=Y|10=062|
8=FIX.4.2|9=74|35=A|49=B|56=A|43=N|34=1|52=23:25:39.064|98=0|108=30|10=004|
8=FIX.4.2|9=86|35=1|49=B|56=A|43=N|34=2|52=23:25:39.065|112=03/03/2014-07:25:39|10=120|
8=FIX.4.2|9=75|35=A|34=1|49=A|52=23:25:38.076|56=B|98=0|108=30|141=Y|10=065|
8=FIX.4.2|9=74|35=A|49=B|56=A|43=N|34=1|52=23:25:51.064|98=0|108=30|10=254|
8=FIX.4.2|9=86|35=1|49=B|56=A|43=N|34=2|52=23:25:51.065|112=03/03/2014-07:25:51|10=108|
8=FIX.4.2|9=75|35=A|34=1|49=A|52=23:25:50.076|56=B|98=0|108=30|141=Y|10=059|
8=FIX.4.2|9=74|35=A|49=B|56=A|43=N|34=1|52=23:26:03.064|98=0|108=30|10=252|
8=FIX.4.2|9=86|35=1|49=B|56=A|43=N|34=2|52=23:26:03.065|112=03/03/2014-07:26:03|10=104|
8=FIX.4.2|9=75|35=A|34=1|49=A|52=23:26:02.076|56=B|98=0|108=30|141=Y|10=057|
8=FIX.4.2|9=74|35=A|49=B|56=A|43=N|34=1|52=23:26:15.064|98=0|108=30|10=255|
8=FIX.4.2|9=86|35=1|49=B|56=A|43=N|34=2|52=23:26:15.065|112=03/03/2014-07:26:15|10=110|
8=FIX.4.2|9=75|35=A|34=1|49=A|52=23:26:14.076|56=B|98=0|108=30|141=Y|10=060|
8=FIX.4.2|9=74|35=A|49=B|56=A|43=N|34=1|52=23:26:27.064|98=0|108=30|10=002|
8=FIX.4.2|9=86|35=1|49=B|56=A|43=N|34=2|52=23:26:27.065|112=03/03/2014-07:26:27|10=116|
8=FIX.4.2|9=75|35=A|34=1|49=A|52=23:26:26.076|56=B|98=0|108=30|141=Y|10=063|
8=FIX.4.2|9=74|35=A|49=B|56=A|43=N|34=1|52=23:26:39.065|98=0|108=30|10=006|

Please note that it keep trying to logon the session again but fail.

At the mean time, I am sure that the app keep receiving messages from counter party. The proof is that FIX.4.2-A-B.message.log file keeping growing with valid messages(not just heart beat but other valid message). That mean the connection isn't lost.

Why can't I logon the session?

Please help


The third updates:

My initiate successful logs for the logon is as below:

8=FIX.4.2|9=75|35=A|34=1|49=A|52=23:00:07.095|56=B|98=0|108=30|141=Y|10=055|
8=FIX.4.2|9=74|35=A|49=B|56=A|43=N|34=1|52=23:00:20.221|98=0|108=30|10=238|
8=FIX.4.2|9=86|35=1|49=B|56=A|43=N|34=2|52=23:00:20.222|112=07:00:20|10=081|
8=FIX.4.2|9=81|35=0|34=2|49=A|52=23:00:07.301|56=B|112=07:00:20|10=091|
8=FIX.4.2|9=57|35=0|34=3|49=A|52=23:00:38.075|56=B|10=228|
8=FIX.4.2|9=63|35=0|49=B|56=A|43=N|34=26|52=23:00:51.260|10=00

The third line indicates that counter party send me a TEST request and I didn't respond, which seems ok and the connection established.

Should I explicitly handle the response of the TEST request? It seems quickfix/j would handle for me.

Striker answered 25/2, 2014 at 3:47 Comment(17)
Sounds like a configuration problem. As you say, the messages look right.Cloyd
What is the value of HeartBtInt in your config file ?Selfimmolating
@Selfimmolating The heartbeat interval is 30s, as agreed with counterpartyStriker
@PeterLawrey the main problem I'm worried about is that on one hand the program didn't handle the message (onMessage() is not called) and log file said disconnecting, which made me thinking the program is disconnected with counter party. On another hand, they're actually connecting and receiving data into (FIX.4.2-A-B.messages.log) in the background. It's really strange. How to re-connect to counter party manually? Should re-connection be doing by quickfixj itself?Striker
@user838204 AFAIK QuickFix/J handles the connection/reconnection.Cloyd
@user838204 Can you please show your subclass of MessageCracker?Steak
There are many reasons why this can happen. In general, people are not going to be able to debug your software for you through SO. There is a different between a session logon and a network connection. If QFJ is connected, it will write messages received to the log file but if the session is not logged on it will not forward them to the application callback. The log snipped you included indicates some type of problem processing the logon messages. Possibly your counterparty is rejecting your logon (bad sender/target ID or sequence number, for example) or you are rejecting their response.Folger
@FrankSmith according to you, my quickfix/j program is actually connecting to counter party (NO network connection problem) but at session level, my program could not logon to counter party. Is that right? Thx, it's important to me. And may I ask what codes should I put within the fromAdmin() method or other methods to handle the reconnection problem? Or just let quickfixj handle it itself?I'm initiator.Striker
Are you connecting to a local server or a remote one? If you are connecting locally what are the TargetCompIDs, SenderCompIDs and SenderSubIDs that you are using for each side?Dodeca
@Dodeca I'm connecting to the remote serverStriker
Is this behaviour reproducible? Looks to me like QFJ noticed the heartbeat timeout at the same time as the counterparty and dropped the connection and because of this did not reply back to the TestRequest. As you can see QFJ even did not send out its own TestRequest because the connection dropped beforeSinapism
share your config file and your FromApp/Admin implementationsHurryscurry
@Workforce unfortunately it's not reproducible on my side, please refer to the links above. It may give you some hitsStriker
@alihaider Please find the updatesStriker
Can you confirm if your session is connected and the logon has happened successfully. From your post it doesn't seem the logon has happened, hence the heartbeat isn't being processed.Selfimmolating
@Selfimmolating Plz refer to the second update. I'm wondering why I can't logon to the session? Any explicit work to do to logon? Is is automatically handler by quickfix/jStriker
Was this issue solved? I am facing similar problemVillagomez
A
8

the FIX standard heartbeat behavior is as follows:

1. On Logon, the Initiator requests a heartbeat interval (usual default 30 seconds)
2. From now on every side expects at least one message every 30 seconds (defined heartbeat interval)
3. If there is no application message available, a heartbeat is sent instead
4. If one side does not get neither a heartbeart nor an application message after 30 seconds, a connection issue is suspected.
5. Therefore, a TestRequest is sent ("Are you still there?"): Sent test request TEST
6. This TestRequest is not answered by a Heartbeat with the supplied TestRequestID, the connection is considered dead.
7. Finally, the network connection is dropped: Disconnecting: Timed out waiting for heartbeat

In your third update, your initiator A did respond to the test request with a heartbeat, and supplied the requested id tag 112:

8=FIX.4.2|9=86|35=1|49=B|56=A|43=N|34=2|52=23:00:20.222|112=07:00:20|10=081|
8=FIX.4.2|9=81|35=0|34=2|49=A|52=23:00:07.301|56=B|112=07:00:20|10=091|

So that's good. From then on, A supplies a heartbeat 30 seconds later:

8=FIX.4.2|9=57|35=0|34=3|49=A|52=23:00:38.075|56=B|10=228|

But B does not reply for 13 seconds:

8=FIX.4.2|9=63|35=0|49=B|56=A|43=N|34=26|52=23:00:51.260|10=00

Is there some network issue between A and B? What problem are you experiencing?

Alexine answered 18/3, 2014 at 14:7 Comment(0)
A
0

Your analysis is generally ok, though here's a closer look at the times for each message:

23:19:05: Sent test request TEST  
23:19:05.909: Sent the 35=1 with 112=TEST
23:19:23.381: Received the 35=0 with 112=TEST
23:19:32: Disco timeout "waiting for heartbeat"
23:19:44: Disco timeout "waiting for logon response"

If there's something strange... in your network...

call the network team?

Something is causing your first disconnect. As the message says it's waiting for a heartbeat that probably doesn't include the 35=0 response to the test request. It's probably waiting for a 35=1 from B.

What kind of messages are filling up the FIX.4.2-A-B.messages.log? Can you post some examples?

Alexine answered 12/3, 2014 at 10:16 Comment(3)
So A is trying to logon to B and B is trying to logon to A. These are the "valid messages" that you're referring to. Which one is running as an initiator and which one is running as acceptor?Alexine
A is initiator and B is acceptor. I think it's not A is trying to logon to B and B is trying to logon to A but A is trying to logon to B and B responds to A and it should workStriker
8=FIX.4.2|9=75|35=A|34=1|49=A|52=23:25:14.076|56=B|98=0|108=30|141=Y|10=059| here you make the login request. 8=FIX.4.2|9=74|35=A|49=B|56=A|43=N|34=1|52=23:25:27.076|98=0|108=30|10=004| here B replies to the login request. 8=FIX.4.2|9=86|35=1|49=B|56=A|43=N|34=2|52=23:25:27.077|112=03/03/2014-07:25:27|10=117| here B is asking for a test request with tag 112=03/03/2014-07:25:27 which needs to be responded to quoting the same tag 112. 8=FIX.4.2|9=75|35=A|34=1|49=A|52=23:25:26.076|56=B|98=0|108=30|141=Y|10=062| You make another login request. But you don't respond to test request with tag 112.Alexine

© 2022 - 2024 — McMap. All rights reserved.