How to remove the quoted text from an email and only show the new text
Asked Answered
C

6

16

I am parsing emails. When I see a reply to an email, I would like to remove the quoted text so that I can append the text to the previous email (even if its a reply).

Typically, you'll see this:

1st email (start of conversation)

This is the first email

2nd email (reply to first)

This is the second email

Tim said:
This is the first email

The output of this would be "This is the second email" only. Although different email clients quote text differently, if there were someway to get mostly the new email text only, that would also be acceptable.

Collinsworth answered 5/3, 2010 at 8:12 Comment(0)
A
16

I use the following regex(s) to match the lead in for quoted text (the last one is the one that counts):

  /** general spacers for time and date */
  private static final String spacers = "[\\s,/\\.\\-]";

  /** matches times */
  private static final String timePattern  = "(?:[0-2])?[0-9]:[0-5][0-9](?::[0-5][0-9])?(?:(?:\\s)?[AP]M)?";

  /** matches day of the week */
  private static final String dayPattern   = "(?:(?:Mon(?:day)?)|(?:Tue(?:sday)?)|(?:Wed(?:nesday)?)|(?:Thu(?:rsday)?)|(?:Fri(?:day)?)|(?:Sat(?:urday)?)|(?:Sun(?:day)?))";

  /** matches day of the month (number and st, nd, rd, th) */
  private static final String dayOfMonthPattern = "[0-3]?[0-9]" + spacers + "*(?:(?:th)|(?:st)|(?:nd)|(?:rd))?";

  /** matches months (numeric and text) */
  private static final String monthPattern = "(?:(?:Jan(?:uary)?)|(?:Feb(?:uary)?)|(?:Mar(?:ch)?)|(?:Apr(?:il)?)|(?:May)|(?:Jun(?:e)?)|(?:Jul(?:y)?)" +
                                              "|(?:Aug(?:ust)?)|(?:Sep(?:tember)?)|(?:Oct(?:ober)?)|(?:Nov(?:ember)?)|(?:Dec(?:ember)?)|(?:[0-1]?[0-9]))";

  /** matches years (only 1000's and 2000's, because we are matching emails) */
  private static final String yearPattern  = "(?:[1-2]?[0-9])[0-9][0-9]";

  /** matches a full date */
  private static final String datePattern     = "(?:" + dayPattern + spacers + "+)?(?:(?:" + dayOfMonthPattern + spacers + "+" + monthPattern + ")|" +
                                                "(?:" + monthPattern + spacers + "+" + dayOfMonthPattern + "))" +
                                                 spacers + "+" + yearPattern;

  /** matches a date and time combo (in either order) */
  private static final String dateTimePattern = "(?:" + datePattern + "[\\s,]*(?:(?:at)|(?:@))?\\s*" + timePattern + ")|" +
                                                "(?:" + timePattern + "[\\s,]*(?:on)?\\s*"+ datePattern + ")";

  /** matches a leading line such as
   * ----Original Message----
   * or simply
   * ------------------------
   */
  private static final String leadInLine    = "-+\\s*(?:Original(?:\\sMessage)?)?\\s*-+\n";

  /** matches a header line indicating the date */
  private static final String dateLine    = "(?:(?:date)|(?:sent)|(?:time)):\\s*"+ dateTimePattern + ".*\n";

  /** matches a subject or address line */
  private static final String subjectOrAddressLine    = "((?:from)|(?:subject)|(?:b?cc)|(?:to))|:.*\n";

  /** matches gmail style quoted text beginning, i.e.
   * On Mon Jun 7, 2010 at 8:50 PM, Simon wrote:
   */
  private static final String gmailQuotedTextBeginning = "(On\\s+" + dateTimePattern + ".*wrote:\n)";


  /** matches the start of a quoted section of an email */
  private static final Pattern QUOTED_TEXT_BEGINNING = Pattern.compile("(?i)(?:(?:" + leadInLine + ")?" +
                                                                        "(?:(?:" +subjectOrAddressLine + ")|(?:" + dateLine + ")){2,6})|(?:" +
                                                                        gmailQuotedTextBeginning + ")"
                                                                      );

I know that in some ways this is overkill (and might be slow!) but it works pretty well. Please let me know if you find anything that doesn't match this so I can improve it!

Autocratic answered 8/7, 2010 at 4:9 Comment(3)
What is the significance of {2,6} in the QUOTED_TEXT_BEGINNING? Can you give an example which it would match?Church
@Church it requires that there are at least 2 and no more than 6 line from the set of: subject, to, from, bcc, cc, date. A minimum would look like to and subject, at a maximum, all 6. They seemed to occur in any order, so I kept the ordering loose, but wanted to bound it for quality and performance reasons.Autocratic
Thanks for your answer, it's very us full for us. but, This pattern is not or for the following line " On 16-09-2014, Indies Services Test wrote: " please give solution as soon as possibleWillwilla
D
7

Check out the google patent on this: http://www.google.com/patents/US7222299

In summary they hash portions of the text (presumably something like sentences) and then look for matches to hashes in the previous messages. Super fast and they probably use this as input to the threading algorithm too. What a great idea!

Dactylogram answered 9/8, 2013 at 16:16 Comment(0)
S
2

When the previous emails are stored on the disk, or available somwhow, you could check all mails, send by a specific receiver to determine, which is the response text.

You also could try to determine the quote character, by checking the first character of the last lines. Normaly the last lines always start with the same character.

When the last 2 lines starting with a ifferent character, youcould try the first lines, because sometimes the answer is appended atthe end of the text.

If you have detected these character, you could delete the last lines which are starting with this character until a empty line or a line starting with another character is detected.

NOT TESTED and is more like pseudo code

    String[] lines;

    // Check the size of the array first, length > 2
    char startingChar = lines[lines.length - 1].charAt(0);
    int foundCounter = 0;
    for (int i = lines.length - 2; i >=0; --i) {
        String line = lines[i];

        // Check line size > 0
        if(startingChar == line.charAt(0)){
            ++foundCounter;
        }
    }

    final int YOUR_DECISION = 2; // You can decide
    if(foundCounter > YOUR_DECISION){
        deleteLastLinesHere(startingChar, foundCounter);
    }
Sheldon answered 5/3, 2010 at 8:29 Comment(2)
char startingChar = lines[lines.length - 1]; won't compile. Did you mean char startingChar = lines[lines.length - 1].charAt(0);?Basketball
yes, sorry. As i said, this is more like pseudo code ;). I will update the answerSheldon
T
2

RegEx works fine except it matches text that starts from Subject and ignores everything that goes before "Subject"

Text
-------- Original Message -------- 
<TABLE border="0" cellpadding="0" cellspacing="0">
  <TBODY>
    <TR>
      <TH align="right" valign="baseline">
      // the matcher starts working from here
Taiga answered 11/4, 2011 at 15:45 Comment(0)
P
1

From observing the Gmail's behavior in this regard I have observed their strategy:

  1. write the complete 2nd mail.
  2. Append text like: On [timestamp], [first email sender name] <[first email sender email address]> wrote:
  3. Append the complete first email. a. If your email is in plain text then prepend '>' before every line of the first email. b. If it's in HTML then Gmail gives a left side margin like:

    border-left: 1px solid #CCC; margin: 0px 0px 0px 0.8ex; padding-left: 1ex; user agent stylesheet blockquote

    and then appends the first email's text.

You can reverse engineer this when parsing emails from Gmail address. I haven't looked into other clients but they should have the same behavior.

Parmer answered 5/3, 2010 at 8:33 Comment(0)
D
1

You'll get it almost right with a couple of lines of code:

String newMessage = "";
for (String line : emailLines) {
  if (!line.matches("^[>].*")) {
    newMessage = newMessage.concat(line);
  }
}

If necessary, you could add other regex checks for e-mail clients which leave different quoted text signatures.

Drape answered 7/3, 2010 at 1:43 Comment(1)
I like the simplistic approach.Corticosterone

© 2022 - 2024 — McMap. All rights reserved.