Parsing Email Body with 7BIT Content-Transfer-Encoding - PHP
Asked Answered
U

3

7

I've been implementing some PHP/IMAP-based email handling functionality lately, and have most everything working great, except for message body decoding (in some circumstances).

I think that, by now, I've half-memorized RFC 2822 (the 'Internet Message Format' document guidelines), read through email-handling code for half a dozen open source CMSes, and read a bajillion forum posts, blog posts, etc. dealing with handling email in PHP.

I've also forked and completely rewritten a class for PHP, Imap, and the class handles email respectably well—I have some helpful methods in there to detect autoresponders (for out of office, old addresses, etc.), decode base64 and 8bit messages, etc.

However, the one thing I simply can't get to work reliably (or, sometimes, at all) is when a message comes in with Content-Transfer-Encoding: 7bit.

It seems that different email clients/services interpret 7BIT to mean different things. I've gotten some emails that are supposedly 7BIT that are actually Base64-encoded. I've gotten some that are actually quoted-printable-encoded. And some that are not encoded in any way whatsoever. And some that are HTML, but aren't indicated as being HTML, and they're also listed as 7BIT...

Here are a few examples (snips) of message bodies received with 7Bit encodings:

1:

A random message=20

Sent from my iPhone

2:

PGh0bWwgeG1sbnM6dj0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTp2bWwi
IHhtbG5zOm89InVybjpzY2hlbWFzLW1pY3Jvc29mdC1jb206b2ZmaWNlOm9m

3:

tangerine apricot pepper.=0A=C2=A0=0ALet me know if you have any availabili=
ty over the next month or so. =0A=C2=A0=0AThank you,=0ANames Withheld=0A908=
-319-5916=0A=C2=A0=0A=C2=A0=0A=C2=A0=0A=0A=0A______________________________=
__=0AFrom: Names Witheld =0ATo: Names Withheld=

These are all sent with '7Bit' encodings (well, at least according to PHP/imap_*), but they're obviously in need of more decoding before I can pass them along as plaintext. Is there any way to reliably convert all messages with supposedly-7Bit encodings to plaintext?

Unscientific answered 1/10, 2012 at 22:48 Comment(3)
If everyone just sent plaintext email, and used a nice, simple client like Mail for iOS, or mail on the command line, the world would be a better place :)Unscientific
Those are broken messages. 7-bit means plain ascii: all characters in the message should be plain US-ASCII, with no additional encoding. Something there is lying to you. You can certainly try to do heuristic detection.Abhenry
Also, you should pull the original MIME message down with a client like Thunderbird or something and look at it to make sure something in PHP isn't lying to you.Abhenry
U
11

After spending a bit more time, I decided to just write up some heuristic detection, as Max suggested in the comments on my original question.

I've built a more robust decode7Bit() method in Imap.php, which goes through a bunch of common encoded characters (like =A0) and replaces them with their UTF-8 equivalents, and then also decodes messages if they look like they are base64-encoded:

/**
 * Decodes 7-Bit text.
 *
 * PHP seems to think that most emails are 7BIT-encoded, therefore this
 * decoding method assumes that text passed through may actually be base64-
 * encoded, quoted-printable encoded, or just plain text. Instead of passing
 * the email directly through a particular decoding function, this method
 * runs through a bunch of common encoding schemes to try to decode everything
 * and simply end up with something *resembling* plain text.
 *
 * Results are not guaranteed, but it's pretty good at what it does.
 *
 * @param $text (string)
 *   7-Bit text to convert.
 *
 * @return (string)
 *   Decoded text.
 */
public function decode7Bit($text) {
  // If there are no spaces on the first line, assume that the body is
  // actually base64-encoded, and decode it.
  $lines = explode("\r\n", $text);
  $first_line_words = explode(' ', $lines[0]);
  if ($first_line_words[0] == $lines[0]) {
    $text = base64_decode($text);
  }

  // Manually convert common encoded characters into their UTF-8 equivalents.
  $characters = array(
    '=20' => ' ', // space.
    '=E2=80=99' => "'", // single quote.
    '=0A' => "\r\n", // line break.
    '=A0' => ' ', // non-breaking space.
    '=C2=A0' => ' ', // non-breaking space.
    "=\r\n" => '', // joined line.
    '=E2=80=A6' => '…', // ellipsis.
    '=E2=80=A2' => '•', // bullet.
  );

  // Loop through the encoded characters and replace any that are found.
  foreach ($characters as $key => $value) {
    $text = str_replace($key, $value, $text);
  }

  return $text;
}

This was taken from version 1.0-beta2 of the Imap class for PHP that I have on GitHub.

If you have any ideas for making this more efficient, let me know. I originally tried running everything through quoted_printable_decode(), but sometimes PHP would throw exceptions that were vague and unhelpful, so I gave up on that approach.

Unscientific answered 3/10, 2012 at 0:48 Comment(2)
Thank you very much for posting this. Good explanation and well commented. I appreciate that.Captor
I love it-- This is a very simple str_replace() solution to this problem in PHP, thank you.Reginiaregiomontanus
V
5

I know this is an old question.... But I am running into this issue now and it seems that PHP have a solution now.

this function imap_fetchstructure() will give you the type of encoding.

0   7BIT
1   8BIT
2   BINARY
3   BASE64
4   QUOTED-PRINTABLE
5   OTHER

from there you should be able to create a function like this to decode the message

function _encodeMessage($msg, $type){

            if($type == 0){
                return mb_convert_encoding($msg, "UTF-8", "auto");
            } elseif($type == 1){
                return imap_8bit($msg); //imap_utf8
            } elseif($type == 2){
                return imap_base64(imap_binary($msg));
            } elseif($type == 3){
                return imap_base64($msg);
            } elseif($type == 4){
                return imap_qprint($msg);
                //return quoted_printable_decode($msg);
            } else {
                return $msg;
            }
        }

and you can call this function like so

$struct = imap_fetchstructure($conn, $messageNumber, 0);
$message = imap_fetchbody($conn, $messageNumber, 1);
$message = _encodeMessage($message, $struct->encoding);
echo $message;

I hope this helps someone :)

Vaporetto answered 16/3, 2015 at 15:58 Comment(1)
Note that this is the technique I'm using in the Imap library mentioned in my answer; however, PHP almost always says a message is 7BIT encoded, even if it's not, so it's often necessary to do the manual decoding mentioned in my answer :(Unscientific
F
0

$structure = imap_fetchstructure; NOT $encoding = $structure->encoding BUT $encoding = $structure->parts[ $p ]->encoding

I think I had the same problem, now it's solved. (7bit didn't convert to UTF-8, kept getting ASCII) I thought I had 7bit, but changing the code to "BUT" I got $encoding=4, not $encoding=0 which means that I have to imap_qprint($body) and mb_convert_encoding($body, 'UTF-8', $charset) to get what I wanted.

Anyway check the encoding number!! ( should be 4 not zero )

Ferdinand answered 14/6, 2017 at 20:57 Comment(1)
this is clearly incomplete code... what is $p supposed to represent? $encoding = $structure->parts[ $p ]->encoding will return $p as undefined.Ineradicable

© 2022 - 2024 — McMap. All rights reserved.