parsing email text reply/forward

D

4

13

I am creating a web based email client using c# asp.net.

What is confusing is that various email clients seem to add the original text in alot of different ways when replying by email.

What I was wondering is that, if there is some sort of standardized way, to disambiguate this process?

Thank you -Theo

Divergence answered 11/3, 2010 at 11:11 Comment(0)

A

2

There isn't a standardized way, but a sensible heuristic will get you a good distance.

Some algorithms classify lines based on their initial character(s) and by comparing the text to a corpus of marked up text, yielding a statistical probability for each line that it is a) part of the same block as the next/previous one and b) quoted text, a signature, new text, etc.

It'd be worth trying out some of the most popular e-mail clients and creating and comparing some sample messages to see what the differences are. Usenet newsgroups may also help you build a reasonable corpus of messages to work from. HTML e-mail adds an extra level of complexity of course, tthough most compliant mail clients will included the corresponding plain text as well. Different languages also cause issues, as clients which can parse "Paul wrote:" may fall over at "Pablo ha scritto:".

Aiguillette answered 19/3, 2010 at 21:14 Comment(2)

Not necessarily true, because Paul Wrote: usually has a date and an <[email protected]> which is language independant. – Divergence 22/3, 2010 at 11:38

I was reffering to the last comment, the first link is rather helpful, and someone needs to get the bounty – Divergence 22/3, 2010 at 12:54

D

3

I was thinking:

public String cleanMsgBody(String oBody, out Boolean isReply) 
{
    isReply = false;

    Regex rx1 = new Regex("\n-----");
    Regex rx2 = new Regex("\n([^\n]+):([ \t\r\n\v\f]+)>");
    Regex rx3 = new Regex("([0-9]+)/([0-9]+)/([0-9]+)([^\n]+)<([^\n]+)>");

    String txtBody = oBody;

    while (txtBody.Contains("\n\n")) txtBody = txtBody.Replace("\n\n", "\n");
    while (new Regex("\n ").IsMatch(txtBody)) txtBody = (new Regex("\n ")).Replace(txtBody, "\n");
    while (txtBody.Contains("  ")) txtBody = txtBody.Replace("  ", " ");

    if (isReply = (isReply || rx1.IsMatch(txtBody)))
        txtBody = rx1.Split(txtBody)[0]; // Maybe a loop through would be better
    if (isReply = (isReply || rx2.IsMatch(txtBody)))
        txtBody = rx2.Split(txtBody)[0]; // Maybe a loop through would be better
    if (isReply = (isReply || rx3.IsMatch(txtBody))) 
        txtBody = rx3.Split(txtBody)[0]; // Maybe a loop through would be better

    return txtBody;
}

Divergence answered 22/3, 2010 at 10:7 Comment(0)

F

2

Not really, no.

The original RFC for Internet Message talks about the in-reply-to header, but doesn't specify the format of the body.

As you've found, different clients add the original text in different ways, implying there's not a standard, coupled with the fact that users will do things differently as well:

Plain text, "rich text", HTML will all have a different way of separating the reply from the original
In Outlook I can choose from the following options when replying to a message:
Do not include
Attach original message
Include original message text
Include and indent original message text
Prefix each line of the original message
On top of that, I often send and receive replies that state "Responses in-line" where my comments are intermingled with the original message, so the original message no longer exists in its original form anyway.

Florence answered 15/3, 2010 at 13:6 Comment(1)

Hi, I know there is no official way of doing this, but I am sure using enough Regex coupled with email header parsing, a solution can be found. "Don't find fault, find a remedy." "I am looking for a lot of men who have an infinite capacity to not know what can't be done." - Henry Ford x2 – Divergence 15/3, 2010 at 17:32

A

2

There isn't a standardized way, but a sensible heuristic will get you a good distance.

Some algorithms classify lines based on their initial character(s) and by comparing the text to a corpus of marked up text, yielding a statistical probability for each line that it is a) part of the same block as the next/previous one and b) quoted text, a signature, new text, etc.

It'd be worth trying out some of the most popular e-mail clients and creating and comparing some sample messages to see what the differences are. Usenet newsgroups may also help you build a reasonable corpus of messages to work from. HTML e-mail adds an extra level of complexity of course, tthough most compliant mail clients will included the corresponding plain text as well. Different languages also cause issues, as clients which can parse "Paul wrote:" may fall over at "Pablo ha scritto:".

Aiguillette answered 19/3, 2010 at 21:14 Comment(2)

Not necessarily true, because Paul Wrote: usually has a date and an <[email protected]> which is language independant. – Divergence 22/3, 2010 at 11:38

I was reffering to the last comment, the first link is rather helpful, and someone needs to get the bounty – Divergence 22/3, 2010 at 12:54

F

1

Some heuristics you can try are

-Any number of > characters -Looking for "wrote: " (be very careful with this one)

Also you can try relating the Message ID field with the In Reply To field

And finally, if you cannot find a good library to do this, it is time to start this project. No more parsing emails the Cthulhu way :)

Flavorous answered 21/3, 2010 at 20:29 Comment(0)

Recommended topics

Hot tags