What's the most efficient way to parse FIX Protocol messages in .NET?
Asked Answered
I

4

9

I came across this very similar question but that question is tagged QuickFIX (which is not relevant to my question) and most of the answers are QuickFIX-related.

My question is broader. I'm looking for the most efficient way to parse a FIX Protocol message using C#. By way of background, a FIX message consists of a series of tag/value pairs separated by the ASCII <SOH> character (0x01). The number of fields in a message is variable.

An example message might look like this:

8=FIX.4.2<SOH>9=175<SOH>35=D<SOH>49=BUY1<SOH>56=SELL1<SOH>34=2482<SOH>50=frg<SOH>
52=20100702-11:12:42<SOH>11=BS01000354924000<SOH>21=3<SOH>100=J<SOH>55=ILA SJ<SOH>
48=YY77<SOH>22=5<SOH>167=CS<SOH>207=J<SOH>54=1<SOH>60=20100702-11:12:42<SOH>
38=500<SOH>40=1<SOH>15=ZAR<SOH>59=0<SOH>10=230<SOH>

For each field, the tag (an integer) and the value (for our purposes, a string) are separated by the '=' character. (The precise semantics of each tag are defined in the protocol, but that isn't particularly germane to this question.)

It's often the case that when doing basic parsing, you are only interested in a handful of specific tags from the FIX header, and not really doing random access to every possible field. Strategies I have considered include:

  • Using String.Split, iterating over every element and putting the tag to index mapping in a Hashtable - provides full random-access to all fields if needed at some point

  • (Slight optimisation) Using String.Split, scanning the array for tags of interest and putting the tag to index mapping into another container (not necessarily a Hashtable as it may be a fairly small number of items, and the number of items is known prior to parsing)

  • Scanning the message field by field using String.IndexOf and storing the offset and length of fields of interest in an appropriate structure

Regarding the first two - although my measurements indicate String.Split is pretty fast, as per the documentation the method allocates a new String for each element of the resultant array which can generate a lot of garbage if you're parsing a lot of messages. Can anyone see a better way to tackle this problem in .NET?

EDIT:

Three vital pieces of information I left out:

  1. Tags are not necessarily unique within FIX messages, i.e., duplicate tags can occur under certain circumstances.

  2. Certain types of FIX fields can contain 'embedded <SOH>' in the data - these tags are referred to as being of type 'data' - a dictionary lists the tag numbers that are of this type.

  3. The eventual requirement is to be able to edit the message (particularly replace values).

Ike answered 5/2, 2011 at 15:52 Comment(8)
What is the bandwidth of the internet connection? Where do the parsing results get written to?Twiddle
How large/ are your files/how many do you have so that it does even matter?Phytosociology
@steve: So is your main problem that strings are allocated/copied multiple times by going the String.Split() route? If that is the case, will you have many repetitious strings occurring in the data that you do want to keep?Byler
@Hans - not sure where the internet connection comes in. Parsing results are kept in memory for further processing (see point 3 in my edit). @CodeInChaos - processing is 'in-line' - message rates may be high though. @gmagana - my concern is that the temporary string allocation will hurt performance under high throughput. Thx.Ike
It is highly relevant. There's no point in trying to save 10 microseconds (the cost of the Split calls) when it takes 160 microseconds just to receive the string (the time needed on a broadband connection).Twiddle
@Steve: I had not seem major concern about string allocations since the 128K RAM days (ie, use BCD decimals because you save space, etc)... What are the dynamics of the stuff you are parsing? How many objects total are we talking about? How disparate is the data, will you have thousands of repeated strings, or is it a very large number of different strings? Those are two fundamental questions to answer before giving you a good solution.Byler
@Hans - I see where you're coming from but my concern (and the thrust of the question) is specifically about processing messages efficiently once I have them. I take your point nevertheless. @gmagana - I'm not worried about memory pressure, I'm worrying about latency spikes being caused by the garbage collector. To answer your specific questions, there are typically 50-100 fields in a message (although longer messages are common elsewhere); the data is repetitive but I'm not concerned about saving memory per se, as messages are processed and disgarded very quickly.Ike
@SteveWilkinson - What did you end up doing?Cathrin
K
8

The assumption is that you are getting these messages either over the wire or you are loading them from disk. In either case, you can access these as a byte array and read the byte array in a forward read manner. If you want want/need/require high performance then parse the byte array yourself (for high performance don't use a dictionary of hashtable of tags and values as this is extremely slow by comparison). Parsing the byte array yourself also means that you can avoid using data you are not interested in and you can optimise the parsing to reflect this.

You should be able to avoid most object allocation easily. You can parse FIX float datatypes to doubles quite easily and very quickly without creating objects (you can outperform double.parse massively with your own version here). The only ones you might need to think about a bit more are tag values that are strings e.g. symbol values in FIX. To avoid creating strings here, you could come up with a simple method of determining a unique int identifier for each each symbol (which is a value type) and this will again help you avoid allocation on the heap.

Customised optimised parsing of the message done properly should easily outperform QuickFix and you can do it all with no garbage collection in .NET or Java.

Kisumu answered 4/3, 2011 at 12:25 Comment(0)
L
3

I would definitely start implementing your first approach, because it sounds clear and easy to do.

A Dictionary<int,Field> seems very good to me, maybe wrapped up in a FixMessage class exposing methods like GetFieldHavingTag(int tag) etc...

I don't know the FIX protocol, but looking at you example seems that messages are usually short and the fields as well, so memory allocation pressure shouldn't be a problem.

Of course, the only way to be sure if an approach is good or not for you, is to implement it and test it.

If you notice that the method is slow in case of a lot of messages, then profile it and find what/where is the problem.

If you can't solve it easily, then yes, change strategy, but I'd like to enforce the idea that you need to test it first, then profile it and eventually change it.

So, let's imagine that after your first implementation you've noticed that a lot of strings allocation are slowing down your performaces in case of many messages.

Then yes, I would take an approach similar to your 3rd one, let's call it "on demand/lazy approach".

I'd build a class FixMessage taking the string message and doing nothing until any message-field is needed.
In that case I would use IndexOf (or something similar) to search the requested field/s, perhaps caching results to be faster in case of another equal request.

Levitan answered 5/2, 2011 at 16:50 Comment(2)
Thanks for suggestions. I'm conscious of the dangers of micro-optimisation that others have raised but I equally want to try and pick the best approach to balance performance/efficiency with understandability/maintainability from the get-go. Unfortunately Dictionary<int, Field> isn't ideal because some tags can repeat (as per my later edit); some other (similar) lookup structure is definitely in order though. Appreciate the pointers - Steve.Ike
@SteveWilkinson: yes, I missed your last edit, anyway something like a Dictionary<int,List<Field>> should be good as well. However, even if micro-optimisation is usually bad, asking opinions before start a new implementation is not bad at all ;)Levitan
M
2

I know this is an answer to an older question - I only just recently realized there are a lot of FIX related questions on SO, so thought I'd take a shot at answering this.

The answer to your question may depend on the specific FIX messages you are actually parsing. In some cases, yes - you could just do a 'split' on the string, or what have you, but if you are going to parse all of the messages defined in the protocol, you don't really have a choice but to reference a FIX data dictionary, and to parse the message byte by byte. This is because there are length-encoded fields in FIX messages - according to the specification, which may contain data that would interfere with any kind of "split" approach you might want to take.

The easiest way to do this, is to reference the dictionary and retrieve a message definition based on the type (tag 35) of the message that you've received. Then, you need to extract the tags, one after the other, referencing the corresponding tag definition in the message definition in order to understand how the data that is associated with the tag needs to be parsed. This also helps you in the case of "repeating groups" which may exist in the message - and you'll only be able to understand that a tag represents the start of a repeating group if you have the message definition from the dictionary.

I hope this helps. If you'd like a reference example, I wrote the VersaFix open-source FIX engine for .NET, and that has a dictionary-based message parser in it. You can download the source code directly from our Subversion server by pointing your SVN client at:

svn://assimilate.com/VfxEngine/Trunk

Cheers.

Meeks answered 15/11, 2011 at 1:4 Comment(1)
is the SVN repo still available?Cathrin
R
1

You are probably better off using QuickFix in all honesty and building a Managed C++ wrapper for it. If you are at all concerned with latency then you cannot perform allocations as part of the parsing since that can cause the GC to run which pauses your FIX engine. When paused you cannot send or receive messages which as I am sure you know is very very bad.

There was one company who Microsoft had highlighted a couple years ago as building a FIX engine entirely in c#. They would build a pool of objects to use over the course of the trading day and perform no allocations during the day.

I don't know what your latency requirements are but for what I am doing we have used codegen, different types of multithreaded heaps to get perf and reduce latency. We use a mixture of c++ and haskell.

Depending on your requirements maybe implement your parser as kernel mode driver to allow messages to be constructed as they are received off the wire.

@Hans: 10 microseconds is a very long time. NASDAQ matches orders in 98 microseconds and SGX has announced that it will take 90 microseconds to cross when they roll their new platform this year.

Readymix answered 8/2, 2011 at 16:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.