I have a body of text that I have to scan and each line contains at least 2 and sometimes four parts of information. The problem is that each line can be 1 out of 15-20 different actions.
in ruby the current code looks somewhat like this:
text.split("\n").each do |line| #around 20 times.. .............. expressions['actions'].each do |pat, reg| #around 20 times .................
This is obviously 'THE PROBLEM'. I did manage to make it faster (in C++ by a 50% margin) by combining all the regexen into one but that is still not the speed I require -- I need to parse thousands of these files FAST!
Right now I match them with regexes -- however this is intolerably slow. I started with ruby and hopped over to C++ in hopes that I'd get a speed boost and it just isn't happening.
I've casually read on PEGs and grammar based parsing but it looks somewhat difficult to implement. Is this the direction I should head or are there different routes?
basically I'm parsing poker hand histories and each line of the hand history usually contains 2-3 bits of information that I need to collect: who the player was, how much money or what cards the action entailed.. etc..
Sample text that needs to be parsed:
buriedtens posts $5 The button is in seat #4 *** HOLE CARDS *** Dealt to Mayhem 31337 [8s Ad] Sherwin7 folds OneMiKeee folds syhg99 calls $5 buriedtens raises to $10
After I collect this information each action is turned into an xml node.
Right now my ruby implementation of this is much faster than my C++ one but that's prob. Just cause I have not written in c code for well over 4-5 years
UPDATE: I don't want to post all the code here but so far my hands/second look like the following:
588 hands/second -- boost::spirit in c++ 60 hands/second -- 1 very long and complicated regex in c++ (all the regexen put together) 33 hands/second -- normal regex style in ruby
I'm currently testing antlr to see if we can go any further but as of right now I'm very very happy with spirit's results.
Related question: Efficiently querying one string against multiple regexes.