Decode the utf8 to ISO-8859-1 mail subject to text in .procmailrc file
Asked Answered
G

2

15

Set out to write a simple procmail recipie that would forward the mail if it found the text "ABC Store: New Order" in the subject.

 :0
    * ^(To|From).*[email protected]
    * ^Subject:.*ABC Store: New Order*
    {

Unfortunately the subject field in the mail message coming from the mail server was in MIME encoded-word syntax.

Subject: =?UTF-8?B?QUJDIFN0b3JlOiBOZXcgT3JkZXI=?=

The above subject is utf-8 ISO-8859-1 charset, So was wondering if there are any mechanisms/scripts/utilities to parse this and convert to string format so that I could apply my procmail filter.

Gabrielgabriela answered 18/4, 2015 at 8:51 Comment(2)
What you are looking at is a RFC2047-encoded header. Like it says in the charset part, it is in UTF-8, base64-encoded. There is no ISO-8859-1 here (that's a different encoding; it can't be in ISO-8859-1 aka Latin-1 if it's in UTF-8).Avril
In the general case, the repertoire of UTF-8 is much larger than the repertoire of ISO-8859-1, so you will not always be able to translate UTF-8 to ISO-8859-1. If you only care about unwrapping the RFC2047 encoding and recovering the UTF-8 text, that's always possible (and perhaps a better thing to do).Avril
S
20

You may use perl one liner to decode Subject: before assigment to procmail variable.

# Store "may be encoded" Subject: into $SUBJECT after conversion to ISO-8859-1
:0 h
* ^Subject:.*=\?
SUBJECT=| formail -cXSubject: | perl -MEncode=from_to -pe 'from_to $_, "MIME-Header", "iso-8859-1"'

# Store all remaining cases of Subject: into $SUBJECT
:0 hE
SUBJECT=| formail -cXSubject:

# trigger recipe based also on $SUBJECT content
:0
* ^(To|From).*[email protected]
* SUBJECT ?? ^Subject:.*ABC Store: New Order
{
....
}

Comment (2020-03-07): It may be better to convert to UTF-8 charset instead of ISO-8859-*.

Samaria answered 18/4, 2015 at 10:25 Comment(5)
Nice. I had no idea that MIME-Header was an available encodingAmadoamador
Though the r* in the regex New Order* is kind of silly, and arguably wrong.Avril
Why is the command for the "remaining cases" like this: SUBJECT=| formail -cXSubject without a colon, unlike the command for the first case: SUBJECT=| formail -cXSubject: |?Heiney
I have fixed example to syntax as in man formail examples. Basic test of ` formail -cXSubject` seem to produce correct results too.Samaria
The argument to formail -x is just a string prefix; without the colon you will extract every header which starts with Subject; of course, in practice, unless you are running a fuzz tester or something, only Subject: will actually match.Avril
A
1

You should use MIME::EncWords.

Like this

use strict;
use warnings;
use 5.010;

use MIME::EncWords 'decode_mimewords';

my $subject = '=?UTF-8?B?QUJDIFN0b3JlOiBOZXcgT3JkZXI=?=';
my $decoded = decode_mimewords($subject);
say $decoded;

output

ABC Store: New Order
Amadoamador answered 18/4, 2015 at 15:4 Comment(1)
This only unwraps the RFC2047 encoding; the result is still in UTF-8. Because the OP's regex doesn't contain any characters where the encoding differs between ISO-8859-1 and UTF-8, it doesn't seem to matter; but if you want to match text which is not pure ASCII, the encoding does matter, and you should know which encoding you are using. (Like I argue in another comment, I would actually suggest to keep everything in UTF-8; but that is perhaps not what the OP is requesting. Though the question is unclear on this part.)Avril

© 2022 - 2024 — McMap. All rights reserved.