Problem
I'm trying to find a flexible way to parse email content. Below is an example of dummy email text I'm working with. I'd also like to avoid regular expressions if at all possible. However, at this point of my problem solving process I'm beginning to think it's inevitable. Note that this is only a small dummy subset of a full email. What I need is to parse every field (e.g. Ticket No, Cell Phone) into their respective data types. Lastly, some fields are not guaranteed to be present in the email (you'll see in my current solution shown below why this is a problem).
Header Code:EMERGENCY
Ticket No: 123456789 Seq. No: 2
Update of:
Original Call Date: 01/02/2011 Time: 11:17:03 AM OP: 1102
Second Call Date: 01/02/2011 Time: 12:11:00 AM OP:
Company: COMPANY NAME
Contact: CONTACT NAME Contact Phone: (111)111-1111
Secondary Contact: SECONDARY CONTACT
Alternate Contact: Altern. Phone:
Best Time to Call: AFTER 4:30P Fax No: (111)111-1111
Cell Phone: Pager No:
Caller Address: 330 FOO
FOO AVENUE 123
Current Solution
For this simple example I'm successfully able to parse most fields with the function below.
private T BetweenOperation<T>(string emailBody, string start, string end)
{
var culture = StringComparison.InvariantCulture;
int startIndex =
emailBody.IndexOf(start, culture) + start.Length;
int endIndex =
emailBody.IndexOf(end, culture);
int length = endIndex - startIndex;
if (length < 0) return default(T);
return (T)Convert.ChangeType(
emailBody.Substring(startIndex, length).Trim(),
typeof(T));
}
Essentially, my idea was that I could parse the content between two fields. For example, I could the header code by doing
// returns "EMERGENCY"
BetweenOperation<string>("email content", "Header Code:", "Ticket No:")
This approach however has many flaws. One big flaw being that the end
field is not always present . As you can see there are some similar keys with identical keywords that don't parse quite right, such as "Contact" and "Secondary Contact". This causes the parser to fetch too much information. Also, if my end field is not present I'll be getting some unpredictable result. Lastly, I can parse entire lines to then pass it to BetweenOperation<T>
using this.
private string LineOperation(string startWithCriteria)
{
string[] emailLines = EmailBody.Split(new[] { '\n' });
return
emailLines.Where(emailLine => emailLine.StartsWith(startWithCriteria))
.FirstOrDefault();
}
We would use LineOperation
in some cases where the field name is not unique (e.g. Time) and feed the result to BetweenOperation<T>
.
Question
How can parse the content shown above based on keys. Keys being "Header Code" and "Cell Phone" for example. Note that I don't think that parsing based on spaces of tabs because some of fields can be several lines long (e.g. Caller Address) or contain no value at all (e.g. Altern Phone).
Thank you.