Flexible text parsing strategies

Asked 28/1, 2011 at 18:42 Answered 28/1, 2011 at 19:13

Problem

I'm trying to find a flexible way to parse email content. Below is an example of dummy email text I'm working with. I'd also like to avoid regular expressions if at all possible. However, at this point of my problem solving process I'm beginning to think it's inevitable. Note that this is only a small dummy subset of a full email. What I need is to parse every field (e.g. Ticket No, Cell Phone) into their respective data types. Lastly, some fields are not guaranteed to be present in the email (you'll see in my current solution shown below why this is a problem).

Header Code:EMERGENCY                               
Ticket No:   123456789 Seq. No: 2
Update of:             

Original Call Date:     01/02/2011     Time:      11:17:03 AM  OP: 1102
Second Call Date:     01/02/2011     Time:      12:11:00 AM  OP: 

Company:           COMPANY NAME
Contact:      CONTACT NAME          Contact Phone: (111)111-1111
Secondary Contact: SECONDARY CONTACT
Alternate Contact:                       Altern. Phone:                  
Best Time to Call: AFTER 4:30P           Fax No:        (111)111-1111
Cell Phone:                              Pager No:                       
Caller Address: 330 FOO
                FOO AVENUE 123

Current Solution

For this simple example I'm successfully able to parse most fields with the function below.

private T BetweenOperation<T>(string emailBody, string start, string end)
{
 var culture = StringComparison.InvariantCulture;
 int startIndex =
  emailBody.IndexOf(start, culture) + start.Length;
 int endIndex =
  emailBody.IndexOf(end, culture);
 int length = endIndex - startIndex;

 if (length < 0) return default(T);

 return (T)Convert.ChangeType(
  emailBody.Substring(startIndex, length).Trim(), 
  typeof(T));
}

Essentially, my idea was that I could parse the content between two fields. For example, I could the header code by doing

// returns "EMERGENCY"
BetweenOperation<string>("email content", "Header Code:", "Ticket No:")

This approach however has many flaws. One big flaw being that the end field is not always present . As you can see there are some similar keys with identical keywords that don't parse quite right, such as "Contact" and "Secondary Contact". This causes the parser to fetch too much information. Also, if my end field is not present I'll be getting some unpredictable result. Lastly, I can parse entire lines to then pass it to BetweenOperation<T> using this.

private string LineOperation(string startWithCriteria)
{
    string[] emailLines = EmailBody.Split(new[] { '\n' });

    return 
        emailLines.Where(emailLine => emailLine.StartsWith(startWithCriteria))
        .FirstOrDefault();
}

We would use LineOperation in some cases where the field name is not unique (e.g. Time) and feed the result to BetweenOperation<T>.

Question

How can parse the content shown above based on keys. Keys being "Header Code" and "Cell Phone" for example. Note that I don't think that parsing based on spaces of tabs because some of fields can be several lines long (e.g. Caller Address) or contain no value at all (e.g. Altern Phone).

Thank you.

Crossley answered 28/1, 2011 at 18:42 Comment(4)

This may be more than you need, but language parsers such as ANTLR are an option. – Gooey 28/1, 2011 at 18:52

Do you know the the field names in advance, i.e. there's a fixed set of field names, or can they vary? – Extradite 28/1, 2011 at 19:15

@Eric - I only know what possible field names I can encounter. But I cannot know in advanced (short of parsing) what emails will contain what fields. The fields present are pretty consistent and there are only a few cases where maybe 1-2 fields will be missing. – Crossley 28/1, 2011 at 19:38

Right, but you know the set of all possible fields, which makes the parsing much simpler. – Extradite 28/1, 2011 at 19:44

One way to approach the problem would be to first search the entire text for occurrences of your keys. That is, build an array that looks like:

"Header Code:",1
"Contact Phone:",233
"Cell Phone:",-1  // not there

If you sort that array by position, then you know where to look for things. That is, you'll know which fields follow each.

You'll have to do something with duplicates (i.e. "Time:" and "Time:" in the call dates). And you'll have to resolve "Contact:" and "Secondary Contact:", although that one should be pretty easy.

If you do this with standard string operations (i.e. IndexOf), it's going to be somewhat inefficient because you'll have to search the entire text for all occurrences of every string. Whether that's a problem for you is hard to say. Depends on how many of these you have to do.

If it becomes a problem, you'll probably want to build an Aho-Corasick string matcher, or something similar. Or you could build up a big ol' ugly regex:

"(Header Code:)|(Contact Phone:)|(Cell Phone)" ... etc. Probably with named captures so you know what you're capturing. It should work reasonably well, although it might be difficult to maintain.

Luge answered 28/1, 2011 at 19:3 Comment(1)

Brute force comes to the rescue again. It ain't pretty, but if it works . . . – Luge 29/1, 2011 at 22:29

In my opinion I would parse it by a specific sequence, and following that, modify your email body accordingly.

Specific sequence

Contact:      CONTACT NAME          Contact Phone: (111)111-1111
Secondary Contact: SECONDARY CONTACT
Alternate Contact:

The sequence in which to search for your fields should start with words that are not subsets of any other keyword in your "Fields" (E.G For contacts, the sequence should be "Secondary Contact:", "Alternate Contact:" then lastly "Contact:")

Modify your email body, if you found the field information that you require, you will need to modify the email body in order to remove it. Parsing by a specific sequence, will ensure (I hope) that you won't have the whole mismatch issue since you are removing the subsets last.

Now there is also the issue of the end keyword field. Since the end field is not always guaranteed to be there (And I am unsure whether they will always be in a specific order) you would have to loop through all your keyword fields, and return the index and determine the closest keyword based off the index.

Trinidad answered 28/1, 2011 at 19:13 Comment(0)

I had to do a similar stuff back in the day reading reports from a Pick DB. If your fields are positional based you can simply create an XML Schema of your e-mail message:

<message>
    <line0>
        <element name="Header Code" start="0" end="MAX" type="string"/> 
        <!-- MAX Indicates whole line -->
    </line0> 
    <line1>
        <element name="Ticket No" start="0" end="20" type="string"/>
        <element name="Seq. No" start="22" end="40" type="int" />
    </line1>
</message>

Then to parse the e-mail you would read all text lines of text. For each line (starting from 0) you would find the "line" + index number entity in the schema.

Create a temp string. Foreach element in the "line" + index entity do a substring on the entire line starting from the start to end values defined in the element entity....

Do a Convert on the substring based on the element's type. Save the entity to an object or something.

You can even get more creative by grouping different line + index entities in your schema by via classes:

<message>
    <header>
        <line0>
        ...
        </line0>
    </header>
</message>

Belligerent answered 28/1, 2011 at 18:55 Comment(0)