Elegant structured text file parsing

Asked 21/10, 2008 at 23:0 Answered 1/11, 2009 at 16:17

I need to parse a transcript of a live chat conversation. My first thought on seeing the file was to throw regular expressions at the problem but I was wondering what other approaches people have used.

I put elegant in the title as i've previously found that this type of task has a danger of getting hard to maintain just relying on regular expressions.

The transcripts are being generated by www.providesupport.com and emailed to an account, I then extract a plain text transcript attachment from the email.

The reason for parsing the file is to extract the conversation text for later but also to identify visitors and operators names so that the information can be made available via a CRM.

Here is an example of a transcript file:

Chat Transcript

Visitor: Random Website Visitor 
Operator: Milton
Company: Initech
Started: 16 Oct 2008 9:13:58
Finished: 16 Oct 2008 9:45:44

Random Website Visitor: Where do i get the cover sheet for the TPS report?
* There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button
* Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor.
Milton: Y-- Excuse me. You-- I believe you have my stapler?
Random Website Visitor: I really just need the cover sheet, okay?
Milton: it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire...
Random Website Visitor: oh i found it, thanks anyway.
* Random Website Visitor is now off-line and may not reply. Currently in room: Milton.
Milton: Well, Ok. But… that's the last straw.
* Milton has left the conversation. Currently in room:  room is empty.

Visitor Details
---------------
Your Name: Random Website Visitor
Your Question: Where do i get the cover sheet for the TPS report?
IP Address: 255.255.255.255
Host Name: 255.255.255.255
Referrer: Unknown
Browser/OS: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)

Vincennes answered 21/10, 2008 at 23:0 Comment(1)

How much will the format vary? Will the order of the header/footer lines always be the same? Are there optional headers/footers? – Fluor 1/11, 2009 at 16:43

No and in fact, for the specific type of task you describe, I doubt there's a "cleaner" way to do it than regular expressions. It looks like your files have embedded line breaks so typically what we'll do here is make the line your unit of decomposition, applying per-line regexes. Meanwhile, you create a small state machine and use regex matches to trigger transitions in that state machine. This way you know where you are in the file, and what types of character data you can expect. Also, consider using named capture groups and loading the regexes from an external file. That way if the format of your transcript changes, it's a simple matter of tweaking the regex, rather than writing new parse-specific code.

Trephine answered 21/10, 2008 at 23:25 Comment(0)

With Perl, you can use Parse::RecDescent

It is simple, and your grammar will be maintainable later on.

Frivol answered 22/10, 2008 at 3:1 Comment(0)

You might want to consider a full parser generator.

Regular expressions are good for searching text for small substrings but they're woefully under-powered if you're really interested in parsing the entire file into meaningful data.

They are especially insufficient if the context of the substring is important.

Most people throw regexes at everything because that's what they know. They've never learned any parser generating tools and they end up coding a lot of the production rule composition and semantic action handling that you can get for free with a parser generator.

Regexes are great and all, but if you need a parser they're no substitute.

Connubial answered 22/10, 2008 at 0:23 Comment(1)

I agree. For serious parsing you really want to use a full-fledged parser. But the format of the file you specified doesn't seem complex enough to warrant this (could be wrong). – Trephine 22/10, 2008 at 0:31

Here's two parsers based on lepl parser generator library. They both produce the same result.

from pprint import pprint
from lepl import AnyBut, Drop, Eos, Newline, Separator, SkipTo, Space

# field = name , ":" , value
name, value = AnyBut(':\n')[1:,...], AnyBut('\n')[::'n',...]    
with Separator(~Space()[:]):
    field = name & Drop(':') & value & ~(Newline() | Eos()) > tuple

header_start   = SkipTo('Chat Transcript' & Newline()[2])
header         = ~header_start & field[1:] > dict
server_message = Drop('* ') & AnyBut('\n')[:,...] & ~Newline() > 'Server'
conversation   = (server_message | field)[1:] > list
footer_start   = 'Visitor Details' & Newline() & '-'*15 & Newline()
footer         = ~footer_start & field[1:] > dict
chat_log       = header & ~Newline() & conversation & ~Newline() & footer

pprint(chat_log.parse_file(open('chat.log')))

Stricter Parser

from pprint import pprint
from lepl import And, Drop, Newline, Or, Regexp, SkipTo

def Field(name, value=Regexp(r'\s*(.*?)\s*?\n')):
    """'name , ":" , value' matcher"""
    return name & Drop(':') & value > tuple

Fields = lambda names: reduce(And, map(Field, names))

header_start   = SkipTo(Regexp(r'^Chat Transcript$') & Newline()[2])
header_fields  = Fields("Visitor Operator Company Started Finished".split())
server_message = Regexp(r'^\* (.*?)\n') > 'Server'
footer_fields  = Fields(("Your Name, Your Question, IP Address, "
                         "Host Name, Referrer, Browser/OS").split(', '))

with open('chat.log') as f:
    # parse header to find Visitor and Operator's names
    headers, = (~header_start & header_fields > dict).parse_file(f)
    # only Visitor, Operator and Server may take part in the conversation
    message = reduce(Or, [Field(headers[name])
                          for name in "Visitor Operator".split()])
    conversation = (message | server_message)[1:]
    messages, footers = ((conversation > list)
                         & Drop('\nVisitor Details\n---------------\n')
                         & (footer_fields > dict)).parse_file(f)

pprint((headers, messages, footers))

Output:

({'Company': 'Initech',
  'Finished': '16 Oct 2008 9:45:44',
  'Operator': 'Milton',
  'Started': '16 Oct 2008 9:13:58',
  'Visitor': 'Random Website Visitor'},
 [('Random Website Visitor',
   'Where do i get the cover sheet for the TPS report?'),
  ('Server',
   'There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button'),
  ('Server',
   'Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor.'),
  ('Milton', 'Y-- Excuse me. You-- I believe you have my stapler?'),
  ('Random Website Visitor', 'I really just need the cover sheet, okay?'),
  ('Milton',
   "it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire..."),
  ('Random Website Visitor', 'oh i found it, thanks anyway.'),
  ('Server',
   'Random Website Visitor is now off-line and may not reply. Currently in room: Milton.'),
  ('Milton', "Well, Ok. But… that's the last straw."),
  ('Server',
   'Milton has left the conversation. Currently in room:  room is empty.')],
 {'Browser/OS': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)',
  'Host Name': '255.255.255.255',
  'IP Address': '255.255.255.255',
  'Referrer': 'Unknown',
  'Your Name': 'Random Website Visitor',
  'Your Question': 'Where do i get the cover sheet for the TPS report?'})

Hydromagnetics answered 1/11, 2009 at 16:17 Comment(0)

Build a parser? I can't decide if your data is regular enough for that, but it might be worth looking into.

Fruitage answered 22/10, 2008 at 0:3 Comment(0)

Using multiline, commented regexs can mitigate the maintainance problem somewhat. Try and avoid the one line super regex!

Also, consider breaking the regex down into individual tasks, one for each 'thing' you want to get. eg.

visitor = text.find(/Visitor:(.*)/)
operator = text.find(/Operator:(.*)/)
body = text.find(/whatever....)

instead of

text.match(/Visitor:(.*)\nOperator:(.*)...whatever to giant regex/m) do
  visitor = $1
  operator = $2
  etc.
end

Then it makes it easy to change how any particular item is parsed. As far as parsing through a file with many "chat blocks", just have a single simple regex that matches a single chat block, iterate over the text and pass the match data from this to your group of other matchers.

This will obviously affect performance, but unless you processing enormous files i wouldnt worry.

Catenary answered 21/10, 2008 at 23:18 Comment(0)

Consider using Ragel https://www.colm.net/open-source/ragel/

That's what powers mongrel under the hood. Parsing a string multiple times is going to slow things down dramatically.

Thumb answered 22/10, 2008 at 0:36 Comment(0)

I have used Paul McGuire's pyParsing class library and I continue to be impressed by it, in that it's well-documented, easy to get started, and the rules are easy to tweak and maintain. BTW, the rules are expressed in your python code. It certainly appears that the log file has enough regularity to parse each line as a stand-alone unit.

Balkan answered 22/10, 2008 at 17:5 Comment(0)

Just a quick post, I've only glanced at your transcript example but I've recently also had to look into text parsing and hoped to avoid going the route of hand rolled parsing. I did happen across Ragel which I've only started to get my head around but it's looking to be pretty useful.

Durden answered 22/10, 2008 at 0:11 Comment(0)

Stricter Parser

Recommended topics

Hot tags