Extracting mail's content
Asked Answered
C

2

7

I need to create an app that will extract VAT numbers that our clients send us for verification. They send nothing more with e-mails. That's for purpose of creating extended statistics.

What I need is to have a mail's body without any headers before the content I need, that is VAT number, as simple as that.

This is my script that creates the list of 30 recent e-mails:

<?
if (!function_exists('imap_open')) { die('No function'); }

if ($mbox = imap_open(<confidential>)) {
    $output = "";
    $messageCount = imap_num_msg($mbox);
    $x = 1;     
    for ($i = 0; $i < 30; $i++) {
        $message_id = ($messageCount - $i);
        $fetch_message = imap_header($mbox, $message_id);
        $mail_content = quoted_printable_decode(imap_fetchbody($mbox,$message_id, 1));
        iconv(mb_detect_encoding($mail_content, mb_detect_order(), true), "UTF-8", $mail_content);

        $output .= "<tr>
        <td>".$x.".</td>
        <td>
            ".$fetch_message->from[0]->mailbox."@".$fetch_message->from[0]->host."
        </td>
        <td>
            ".$fetch_message->date."
        </td>
        <td>
            ".$fetch_message->subject."
        </td>
        <td>
            <textarea cols=\"40\">".$mail_content."</textarea>
        </td>
        </tr>";
        $x++;
    }
    $smarty->assign("enquiries", $output);
    $smarty->display("module_mail");
    imap_close($mbox);
} else {
    print_r(imap_errors());
}
?>

I've worked with imap_fetchbody, imap_header and so on to retrieve the desired content but it turns out that most of e-mails have got something else (like headers) before the content, ie.

--=-Dbl2eWTUl0Km+Tj46Ww1
Content-Type: text/plain;

------=_NextPart_001_003A_01D14F7A.F25AB3D0
Content-Type: text/plain;

--=-ucRIRGamiKb0Ot1/AkNc
Content-Type: text/plain;

I need to get rid of everything that's before the VAT number included in the mail's message but I don't know how. Some emails don't have these headers, some do. And since we're working with clients from all over the Europe, it really confuses me and leaves powerless.

Another problem is that some clients just copy-paste VAT numbers from various websites and that means these VAT numbers are often pasted with the original style (bold/background/changed colour et cetera). That might be the reason for my PS below.

I would appreciate every help that'd lead me to solving this problem.

Thank you in advance.

PS. Just for a record. With imap_fetchbody($mbox,$message_id, 1) I need to use 1 to have the whole content. Changing 1 to anything else results in displaying NO email content at all. Literally.

Chiastic answered 15/1, 2016 at 11:32 Comment(3)
Youn could probably use a regex: safaribooksonline.com/library/view/regular-expressions-cookbook/… But this would still fail if the user copies from a website with html in the number its self, such as &nbsp; instead of a space, or span tags etc. Is there a reason you cant just create a simple form on your company website instead, so you can control the format of the data?Erythrocyte
What a pain in the **se! Could you do some form of half way measure, such as create a form that instead of submitting, creates and clicks a mailto: link? That way the users email client opens with preformed content?Erythrocyte
Failing that, i would at least create a page on your site that has input box, takes and validates the number via the above regex, and then spits out a preformated response into a text area, such as our VAT number: ####### with instructions to copy and paste the result. With that in place, i would try and capture the numbers from email using the above regex, and if it failed, send an autoreply asking the customer to visit the above mentioned pageErythrocyte
G
3

The part of the email that you define as "noise" are just part of the format of the email.
In some way is like you were reading the html code of a web page.

All those bits are boundaries. Those elements of the email are like tags in the html and like html they start and they close.

So in your case:

Content-Type: multipart/alternative; boundary="=-Dbl2eWTUl0Km+Tj46Ww1" // define type of email structure and boudary

--=-Dbl2eWTUl0Km+Tj46Ww1    // used to start the section
Content-Type: text/plain;   // to define the type of content of the section
// here there is your VAT presumbly

--=-Dbl2eWTUl0Km+Tj46Ww1--  // used to close the section

Possibles solutions

Actually you have at least 2 solutions.
Make a custom parser by yourself or use a PECL library called Mailparse.

Manually make a parser:

$mail_lines = explode($mail_content, "\n");

foreach ($mail_lines as $key => $line) {
     // jump most of the headrs
     if ($key < 5) {
         continue;
     }

     // skip tag lines
     if (strpos($line, "--")) {
        continue;
     }

     // skip Content lines
     if (strpos($line, "Content")) {
        continue;
     }

     if (empty(trim($line))) {
        continue;
     } 

     ////////////////////////////////////////////////////
     // here you have to insert the logic for the parser
     // and extend the guard clauses
     ////////////////////////////////////////////////////
}

Mailparse:

Install Mail parse sudo pecl install mailparse .

Extract the VAT :

$mail = mailparse_msg_create();
mailparse_msg_parse($mail, $mail_content);
$struct = mailparse_msg_get_structure($mail); 

foreach ($struct as $st) { 
    $section = mailparse_msg_get_part($mail, $st); 
    $info = mailparse_msg_get_part_data($section); 

    print_r($info);
}
Gulosity answered 21/1, 2016 at 6:41 Comment(0)
A
0

You have to use imap_fetchstructure() to find the plain text part of the mail.

The following code can give you the section number of the text/plain subpart (for instance "1.1")

 function getTextPart($struct) {
    if ($struct->type==0) return "1";
    if ($struct->type==1) {
            $num=1;
            foreach ($struct->parts as $part) {
                    if (($part->type==0)&&($part->subtype="PLAIN")) {
                            return $num;
                    } else if ($part->type==1) {
                            $found=getTextPart($part);
                            if ($found) return "$num.$found";
                    }
                    $num++;
            }
    }
    return NULL;
 }

Example of use:

if ($imap) {
    $messageCount = imap_num_msg($imap);
    for ($i = 1; $i < 30; $i++) {
            $struct=imap_fetchstructure($imap, $i);
            $part=getTextPart($struct);
            $body=imap_fetchbody($imap, $i, $part);
            print_r($body);
    }
 }
Antihistamine answered 23/1, 2016 at 19:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.