Parsing FIX protocol in regex?
Asked Answered
S

3

7

I need to parse a logfiles that contains FIX protocol messages.

Each line contains header information (timestamp, logging level, endpoint), followed by a FIX payload.

I've used regex to parse the header information into named groups. E.g.:

 <?P<datetime>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}.\d{6}) (?<process_id>\d{4}/\d{1,2})\s*(?P<logging_level>\w*)\s*(?P<endpoint>\w*)\s*

I then come to the FIX payload itself (^A is the separator between each tag) e.g:

8=FIX.4.2^A9=61^A35=A...^A11=blahblah...

I need to extract specific tags from this (e.g. "A" from 35=, or "blahblah" from 11=), and ignore all the other stuff - basically I need to ignore anything before "35=A", and anything after up to "11=blahblah", then ignore anything after that etc.

I do know there a libraries that might be able to parse each and every tag (http://source.kentyde.com/fixlib/overview), however, I was hoping for a simple approach using regex here if possible, since I really only need a couple of tags.

Is there a good way in regex to extract the tags I require?

Cheers, Victor

Smallscale answered 21/11, 2011 at 5:35 Comment(0)
L
0

Use a regex tool like expresso or regexbuddy.
Why don't you split on ^A and then match ([^=])+=(.*) for each one putting them into a hash? You could also filter with a switch that by default won't add the tags you're uninterested in and that has a fall through for all the tags you are interested in.

Lili answered 21/11, 2011 at 5:37 Comment(0)
M
9

No need to split on "\x01" then regex then filter. If you wanted just tags 34,49 and 56 (MsgSeqNum, SenderCompId and TargetCompId) you could regex:

dict(re.findall("(?:^|\x01)(34|49|56)=(.*?)\x01", raw_msg))

Simple regexes like this will work if you know your sender does not have embedded data that could cause a bug in any simple regex. Specifically:

  1. No Raw Data fields (actually combination of data len and raw data like RawDataLength,RawData (95/96) or XmlDataLen, XmlData (212,213)
  2. No encoded fields for unicode strings like EncodedTextLen, EncodedText (354/355)

To handle those cases takes a lot of additional parsing. I use a custom python parser but even the fixlib code you referenced above gets these cases wrong. But if your data is clear of these exceptions the regex above should return a nice dict of your desired fields.

Edit: I've left the above regex as-is but it should be revised so that the final match element be (?=\x01). The explanation can be found in @tropleee's answer here.

Maramarabel answered 17/1, 2012 at 21:34 Comment(2)
This is a better answer than the accepted one. You need to account for the "len" fields, for sure. Everyone always forgets about these! In addition, FIX messages can contain newline characters (i.e. in tag 58), so you need to use re.DOTALL to be sure.Hoem
As explained in this question, this solution has a bug -- it will fail when two matches are adjacent.Garmaise
F
1

^A is actually \x{01}, thats just how it shows up in vim. In perl, I had done this via a split on hex 1 and then a split on "=", at the second split, value [0] of the array is the Tag and value [1] is the Value.

Foofaraw answered 29/12, 2011 at 19:5 Comment(0)
L
0

Use a regex tool like expresso or regexbuddy.
Why don't you split on ^A and then match ([^=])+=(.*) for each one putting them into a hash? You could also filter with a switch that by default won't add the tags you're uninterested in and that has a fall through for all the tags you are interested in.

Lili answered 21/11, 2011 at 5:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.