C#: Class for decoding Quoted-Printable encoding?
Asked Answered
M

15

29

Is there an existing class in C# that can convert Quoted-Printable encoding to String? Click on the above link to get more information on the encoding.

The following is quoted from the above link for your convenience.

Any 8-bit byte value may be encoded with 3 characters, an "=" followed by two hexadecimal digits (0–9 or A–F) representing the byte's numeric value. For example, a US-ASCII form feed character (decimal value 12) can be represented by "=0C", and a US-ASCII equal sign (decimal value 61) is represented by "=3D". All characters except printable ASCII characters or end of line characters must be encoded in this fashion.

All printable ASCII characters (decimal values between 33 and 126) may be represented by themselves, except "=" (decimal 61).

ASCII tab and space characters, decimal values 9 and 32, may be represented by themselves, except if these characters appear at the end of a line. If one of these characters appears at the end of a line it must be encoded as "=09" (tab) or "=20" (space).

If the data being encoded contains meaningful line breaks, they must be encoded as an ASCII CR LF sequence, not as their original byte values. Conversely if byte values 13 and 10 have meanings other than end of line then they must be encoded as =0D and =0A.

Lines of quoted-printable encoded data must not be longer than 76 characters. To satisfy this requirement without altering the encoded text, soft line breaks may be added as desired. A soft line break consists of an "=" at the end of an encoded line, and does not cause a line break in the decoded text.

Marcionism answered 9/2, 2010 at 3:27 Comment(1)
I just posted a simple answer to UTF8 decoding here: #37540744Survivor
B
22

There is functionality in the framework libraries to do this, but it doesn't appear to be cleanly exposed. The implementation is in the internal class System.Net.Mime.QuotedPrintableStream. This class defines a method called DecodeBytes which does what you want. The method appears to be used by only one method which is used to decode MIME headers. This method is also internal, but is called fairly directly in a couple of places, e.g., the Attachment.Name setter. A demonstration:

using System;
using System.Net.Mail;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            Attachment attachment = Attachment.CreateAttachmentFromString("", "=?iso-8859-1?Q?=A1Hola,_se=F1or!?=");
            Console.WriteLine(attachment.Name);
        }
    }
}

Produces the output:

¡Hola,_señor!

You may have to do some testing to ensure carriage returns, etc are treated correctly although in a quick test I did they seem to be. However, it may not be wise to rely on this functionality unless your use-case is close enough to decoding of a MIME header string that you don't think it will be broken by any changes made to the library. You might be better off writing your own quoted-printable decoder.

Bidden answered 10/2, 2010 at 10:55 Comment(4)
This doesn't convert =3D etc for me and the codeproject version fails on decode.Roadblock
There is bug in framework 2, it cannot handle underscore in the encode string. the underscore is represent space for old parse. It fixed in version 4. This way has another small issue, if the character set information is not at the begin of string, it cannot decode it. the encode string like this **ID: 12345 Arpège **Steffi
The Attachment class can also encode in this format. Create an object with empty strings, set the Name, then use ToString() to get the encoded name. Unfortunately, the output also has the MIME type, and charset, e.g. "text/plain; name=\"=?utf-8?B?SWN ... nUg?=\"; charset=us-ascii" which would need to be stripped off. Also, it follows RFC2074 and breaks the input into pieces such that the encoded words are not longer than 75 characters.Liaotung
Doesn't take into account the 76 character continuation character.Jitter
D
19

I extended the solution of Martin Murphy and I hope it will work in every case.

private static string DecodeQuotedPrintables(string input, string charSet)
{           
    if (string.IsNullOrEmpty(charSet))
    {
        var charSetOccurences = new Regex(@"=\?.*\?Q\?", RegexOptions.IgnoreCase);
        var charSetMatches = charSetOccurences.Matches(input);
        foreach (Match match in charSetMatches)
        {
            charSet = match.Groups[0].Value.Replace("=?", "").Replace("?Q?", "");
            input = input.Replace(match.Groups[0].Value, "").Replace("?=", "");
        }
    }

    Encoding enc = new ASCIIEncoding();
    if (!string.IsNullOrEmpty(charSet))
    {
        try
        {
            enc = Encoding.GetEncoding(charSet);
        }
        catch
        {
            enc = new ASCIIEncoding();
        }
    }

    //decode iso-8859-[0-9]
    var occurences = new Regex(@"=[0-9A-Z]{2}", RegexOptions.Multiline);
    var matches = occurences.Matches(input);
    foreach (Match match in matches)
    {
        try
        {
            byte[] b = new byte[] { byte.Parse(match.Groups[0].Value.Substring(1), System.Globalization.NumberStyles.AllowHexSpecifier) };
            char[] hexChar = enc.GetChars(b);
            input = input.Replace(match.Groups[0].Value, hexChar[0].ToString());
        }
        catch { }
    }

    //decode base64String (utf-8?B?)
    occurences = new Regex(@"\?utf-8\?B\?.*\?", RegexOptions.IgnoreCase);
    matches = occurences.Matches(input);
    foreach (Match match in matches)
    {
        byte[] b = Convert.FromBase64String(match.Groups[0].Value.Replace("?utf-8?B?", "").Replace("?UTF-8?B?", "").Replace("?", ""));
        string temp = Encoding.UTF8.GetString(b);
        input = input.Replace(match.Groups[0].Value, temp);
    }

    input = input.Replace("=\r\n", "");
    return input;
}
Dahle answered 29/11, 2011 at 8:15 Comment(7)
Thank you so much... It solved my problem, this is what I am looking for many hours..Alula
Doesn't support the = and carriage return in this line "If you believe that truth=3Dbeauty, then surely =\nmathematics is the most beautiful branch of philosophy."Mulligatawny
This worked for my limited requirements, thanks. BTW, why the empty catch block?Halm
This method must be refactored. I added working copy but not final version. "=[0-9A-Z]{2}"can be decoded not always. For example when input is "3+7=10"Dahle
Hey, thanks for your code! I'm having trouble now and I'd appreciate if you could take a look hereCortex
This code doesn't work for UTF-8 characters because it tries to decode each byte individually rather than as a set. See Demented Devil's comment (posted as a 'solution').Rung
This doesn't take into account the continuation character.Jitter
R
9

I wrote this up real quick.

    public static string DecodeQuotedPrintables(string input)
    {
        var occurences = new Regex(@"=[0-9A-H]{2}", RegexOptions.Multiline);
        var matches = occurences.Matches(input);
        var uniqueMatches = new HashSet<string>(matches);
        foreach (string match in uniqueMatches)
        {
            char hexChar= (char) Convert.ToInt32(match.Substring(1), 16);
            input =input.Replace(match, hexChar.ToString());
        }
        return input.Replace("=\r\n", "");
    }       
Roadblock answered 10/5, 2011 at 20:10 Comment(2)
Needed to change the uniqueMatches line to handle MatchCollection properly but otherwise this worked. Thanks!Outspoken
QuotedPrintables also have 76 character limit and uses line continuation markers (=). This method doesn't take this into account.Jitter
H
7

I was looking for a dynamic solution and spent 2 days trying different solutions. This solution will support Japanese characters and other standard character sets

private static string Decode(string input, string bodycharset) {
        var i = 0;
        var output = new List<byte>();
        while (i < input.Length) {
            if (input[i] == '=' && input[i + 1] == '\r' && input[i + 2] == '\n') {
                //Skip
                i += 3;
            } else if (input[i] == '=') {
                string sHex = input;
                sHex = sHex.Substring(i + 1, 2);
                int hex = Convert.ToInt32(sHex, 16);
                byte b = Convert.ToByte(hex);
                output.Add(b);
                i += 3;
            } else {
                output.Add((byte)input[i]);
                i++;
            }
        }


        if (String.IsNullOrEmpty(bodycharset))
            return Encoding.UTF8.GetString(output.ToArray());
        else {
            if (String.Compare(bodycharset, "ISO-2022-JP", true) == 0)
                return Encoding.GetEncoding("Shift_JIS").GetString(output.ToArray());
            else
                return Encoding.GetEncoding(bodycharset).GetString(output.ToArray());
        }

    }

Then you can call the function with

Decode("=E3=82=AB=E3=82=B9=E3", "utf-8")

This was originally found here

Hematite answered 27/7, 2016 at 19:47 Comment(2)
LOVE THIS! Correctly works on UTF-8 & much more efficient than constant Replace() operations on a string. A Stream might be slightly more efficient than a List<byte>, though not as readable, & not a huge deal if the strings aren't too long. (For those interested, see #4016102 ). I've worked w/Unicode translations a lot now, & I still endorse the intermediate byte list/array/stream before decoding; any other way of dealing with the variable number of bytes of UTF-8 encoding is insanely complicated and not worth it.Rung
System.FormatException: 'Could not find any recognizable digits.'. on int hex = Convert.ToInt32(sHex, 16);Jitter
D
6

If you are decoding quoted-printable with UTF-8 encoding you will need to be aware that you cannot decode each quoted-printable sequence one-at-a-time as the others have shown if there are runs of quoted printable characters together.

For example - if you have the following sequence =E2=80=99 and decode this using UTF8 one-at-a-time you get three "weird" characters - if you instead build an array of three bytes and convert the three bytes with the UTF8 encoding you get a single aphostrope.

Obviously if you are using ASCII encoding then one-at-a-time is no problem however decoding runs means your code will work regardless of the text encoder used.

Oh and don't forget =3D is a special case that means you need to decode whatever you have one more time... That is a crazy gotcha!

Hope that helps

Deice answered 22/12, 2011 at 9:39 Comment(1)
Yep, this is closer to Skeet's answer over here. We should think of Quoted Printable not as characters, but as a means of serializing bytes. First decode to bytes, then decode that byte array to a string for your encoding -- System.Text.Encoding.UTF8.GetString(byteArray) or what have you. Trying to "salvage" characters from bytes less than 128 is a head-threading mistake.Kowalczyk
I
2
    private string quotedprintable(string data, string encoding)
    {
        data = data.Replace("=\r\n", "");
        for (int position = -1; (position = data.IndexOf("=")) != -1;)
        {
            string leftpart = data.Substring(0, position);
            System.Collections.ArrayList hex = new System.Collections.ArrayList();
            hex.Add(data.Substring(1 + position, 2));
            while (position + 3 < data.Length && data.Substring(position + 3, 1) == "=")
            {
                position = position + 3;
                hex.Add(data.Substring(1 + position, 2));
            }
            byte[] bytes = new byte[hex.Count];
            for (int i = 0; i < hex.Count; i++)
            {
                bytes[i] = System.Convert.ToByte(new string(((string)hex[i]).ToCharArray()), 16);
            }
            string equivalent = System.Text.Encoding.GetEncoding(encoding).GetString(bytes);
            string rightpart = data.Substring(position + 3);
            data = leftpart + equivalent + rightpart;
        }
        return data;
    }
Illdisposed answered 22/4, 2016 at 21:55 Comment(3)
this.Text = quotedprintable("=3D", "utf-8");Illdisposed
this supports sequential hex combined as bytes then converted to the encodingIlldisposed
Works for multiple consecutive multi-byte characters. Excellent!Monolingual
D
1

Better solution

    private static string DecodeQuotedPrintables(string input, string charSet)
    {
        try
        {
            enc = Encoding.GetEncoding(CharSet);
        }
        catch
        {
            enc = new UTF8Encoding();
        }

        var occurences = new Regex(@"(=[0-9A-Z]{2}){1,}", RegexOptions.Multiline);
        var matches = occurences.Matches(input);

    foreach (Match match in matches)
    {
            try
            {
                byte[] b = new byte[match.Groups[0].Value.Length / 3];
                for (int i = 0; i < match.Groups[0].Value.Length / 3; i++)
                {
                    b[i] = byte.Parse(match.Groups[0].Value.Substring(i * 3 + 1, 2), System.Globalization.NumberStyles.AllowHexSpecifier);
                }
                char[] hexChar = enc.GetChars(b);
                input = input.Replace(match.Groups[0].Value, hexChar[0].ToString());
        }
            catch
            { ;}
        }
        input = input.Replace("=\r\n", "").Replace("=\n", "").Replace("?=", "");

        return input;
}
Dahle answered 1/2, 2012 at 10:18 Comment(0)
D
1

This Quoted Printable Decoder works great!

public static byte[] FromHex(byte[] hexData)
    {
        if (hexData == null)
        {
            throw new ArgumentNullException("hexData");
        }

        if (hexData.Length < 2 || (hexData.Length / (double)2 != Math.Floor(hexData.Length / (double)2)))
        {
            throw new Exception("Illegal hex data, hex data must be in two bytes pairs, for example: 0F,FF,A3,... .");
        }

        MemoryStream retVal = new MemoryStream(hexData.Length / 2);
        // Loop hex value pairs
        for (int i = 0; i < hexData.Length; i += 2)
        {
            byte[] hexPairInDecimal = new byte[2];
            // We need to convert hex char to decimal number, for example F = 15
            for (int h = 0; h < 2; h++)
            {
                if (((char)hexData[i + h]) == '0')
                {
                    hexPairInDecimal[h] = 0;
                }
                else if (((char)hexData[i + h]) == '1')
                {
                    hexPairInDecimal[h] = 1;
                }
                else if (((char)hexData[i + h]) == '2')
                {
                    hexPairInDecimal[h] = 2;
                }
                else if (((char)hexData[i + h]) == '3')
                {
                    hexPairInDecimal[h] = 3;
                }
                else if (((char)hexData[i + h]) == '4')
                {
                    hexPairInDecimal[h] = 4;
                }
                else if (((char)hexData[i + h]) == '5')
                {
                    hexPairInDecimal[h] = 5;
                }
                else if (((char)hexData[i + h]) == '6')
                {
                    hexPairInDecimal[h] = 6;
                }
                else if (((char)hexData[i + h]) == '7')
                {
                    hexPairInDecimal[h] = 7;
                }
                else if (((char)hexData[i + h]) == '8')
                {
                    hexPairInDecimal[h] = 8;
                }
                else if (((char)hexData[i + h]) == '9')
                {
                    hexPairInDecimal[h] = 9;
                }
                else if (((char)hexData[i + h]) == 'A' || ((char)hexData[i + h]) == 'a')
                {
                    hexPairInDecimal[h] = 10;
                }
                else if (((char)hexData[i + h]) == 'B' || ((char)hexData[i + h]) == 'b')
                {
                    hexPairInDecimal[h] = 11;
                }
                else if (((char)hexData[i + h]) == 'C' || ((char)hexData[i + h]) == 'c')
                {
                    hexPairInDecimal[h] = 12;
                }
                else if (((char)hexData[i + h]) == 'D' || ((char)hexData[i + h]) == 'd')
                {
                    hexPairInDecimal[h] = 13;
                }
                else if (((char)hexData[i + h]) == 'E' || ((char)hexData[i + h]) == 'e')
                {
                    hexPairInDecimal[h] = 14;
                }
                else if (((char)hexData[i + h]) == 'F' || ((char)hexData[i + h]) == 'f')
                {
                    hexPairInDecimal[h] = 15;
                }
            }

            // Join hex 4 bit(left hex cahr) + 4bit(right hex char) in bytes 8 it
            retVal.WriteByte((byte)((hexPairInDecimal[0] << 4) | hexPairInDecimal[1]));
        }

        return retVal.ToArray();
    }
    public static byte[] QuotedPrintableDecode(byte[] data)
    {
        if (data == null)
        {
            throw new ArgumentNullException("data");
        }

        MemoryStream msRetVal = new MemoryStream();
        MemoryStream msSourceStream = new MemoryStream(data);

        int b = msSourceStream.ReadByte();
        while (b > -1)
        {
            // Encoded 8-bit byte(=XX) or soft line break(=CRLF)
            if (b == '=')
            {
                byte[] buffer = new byte[2];
                int nCount = msSourceStream.Read(buffer, 0, 2);
                if (nCount == 2)
                {
                    // Soft line break, line splitted, just skip CRLF
                    if (buffer[0] == '\r' && buffer[1] == '\n')
                    {
                    }
                    // This must be encoded 8-bit byte
                    else
                    {
                        try
                        {
                            msRetVal.Write(FromHex(buffer), 0, 1);
                        }
                        catch
                        {
                            // Illegal value after =, just leave it as is
                            msRetVal.WriteByte((byte)'=');
                            msRetVal.Write(buffer, 0, 2);
                        }
                    }
                }
                // Illegal =, just leave as it is
                else
                {
                    msRetVal.Write(buffer, 0, nCount);
                }
            }
            // Just write back all other bytes
            else
            {
                msRetVal.WriteByte((byte)b);
            }

            // Read next byte
            b = msSourceStream.ReadByte();
        }

        return msRetVal.ToArray();
    }
Doublereed answered 15/2, 2012 at 14:21 Comment(0)
S
1

The only one that worked for me.

http://sourceforge.net/apps/trac/syncmldotnet/wiki/Quoted%20Printable

If you just need to decode the QPs, pull inside of your code those three functions from the link above:

    HexDecoderEvaluator(Match m)
    HexDecoder(string line)
    Decode(string encodedText)

And then just:

var humanReadable = Decode(myQPString);

Enjoy

Surrogate answered 17/9, 2013 at 6:32 Comment(0)
P
0

Sometimes the string into an EML file is composed by several encoded parts. This is a function to use the Dave's method for these cases:

public string DecodeQP(string codedstring)
{
    Regex codified;

    codified=new Regex(@"=\?((?!\?=).)*\?=", RegexOptions.IgnoreCase);
    MatchCollection setMatches = codified.Matches(cadena);
    if(setMatches.Count > 0)
    {
        Attachment attdecode;
        codedstring= "";
        foreach (Match match in setMatches)
        {
            attdecode = Attachment.CreateAttachmentFromString("", match.Value);
            codedstring+= attdecode.Name;

        }                
    }
    return codedstring;
}
Professor answered 21/12, 2016 at 14:54 Comment(0)
B
0

Please note: solutions with "input.Replace" are all over Internet and still they are not correct.

See, if you have ONE decoded symbol and then use "replace", ALL symbols in "input" will be replaced, and then all following decoding will be broken.

More correct solution:

public static string DecodeQuotedPrintable(string input, string charSet)
    {

        Encoding enc;

        try
        {
            enc = Encoding.GetEncoding(charSet);
        }
        catch
        {
            enc = new UTF8Encoding();
        }

        input = input.Replace("=\r\n=", "=");
        input = input.Replace("=\r\n ", "\r\n ");
        input = input.Replace("= \r\n", " \r\n");
        var occurences = new Regex(@"(=[0-9A-Z]{2})", RegexOptions.Multiline); //{1,}
        var matches = occurences.Matches(input);

        foreach (Match match in matches)
        {
            try
            {
                byte[] b = new byte[match.Groups[0].Value.Length / 3];
                for (int i = 0; i < match.Groups[0].Value.Length / 3; i++)
                {
                    b[i] = byte.Parse(match.Groups[0].Value.Substring(i * 3 + 1, 2), System.Globalization.NumberStyles.AllowHexSpecifier);
                }
                char[] hexChar = enc.GetChars(b);
                input = input.Replace(match.Groups[0].Value, new String(hexChar));

            }
            catch
            { Console.WriteLine("QP dec err"); }
        }
        input = input.Replace("?=", ""); //.Replace("\r\n", "");

        return input;
    }
Baritone answered 2/2, 2018 at 13:40 Comment(1)
And one fun fact more: if it's unicode, it can be just one symbol or set of hexes matches one symbol. =E2=80=99 and =E2 =80 =99 for example. In short, this solution isn't working at all.Baritone
H
0

I know its old question, but this should help

    private static string GetPrintableCharacter(char character)
    {
        switch (character)
        {
            case '\a':
            {
                return "\\a";
            }
            case '\b':
            {
                return "\\b";
            }
            case '\t':
            {
                return "\\t";
            }
            case '\n':
            {
                return "\\n";
            }
            case '\v':
            {
                return "\\v";
            }
            case '\f':
            {
                return "\\f";
            }
            case '\r':
            {
                return "\\r";
            }
            default:
            {
                if (character == ' ')
                {
                    break;
                }
                else
                {
                    throw new InvalidArgumentException(Resources.NOTSUPPORTCHAR, new object[] { character });
                }
            }
        }
        return "\\x20";
    }

    public static string GetPrintableText(string text)
    {
        StringBuilder stringBuilder = new StringBuilder(1024);
        if (text == null)
        {
            return "[~NULL~]";
        }
        if (text.Length == 0)
        {
            return "[~EMPTY~]";
        }
        stringBuilder.Remove(0, stringBuilder.Length);
        int num = 0;
        for (int i = 0; i < text.Length; i++)
        {
            if (text[i] == '\a' || text[i] == '\b' || text[i] == '\f' || text[i] == '\v' || text[i] == '\t' || text[i] == '\n' || text[i] == '\r' || text[i] == ' ')
            {
                num += 3;
            }
        }
        int length = text.Length + num;
        if (stringBuilder.Capacity < length)
        {
            stringBuilder = new StringBuilder(length);
        }
        string str = text;
        for (int j = 0; j < str.Length; j++)
        {
            char chr = str[j];
            if (chr > ' ')
            {
                stringBuilder.Append(chr);
            }
            else
            {
                stringBuilder.Append(StringHelper.GetPrintableCharacter(chr));
            }
        }
        return stringBuilder.ToString();
    }
Hokkaido answered 23/3, 2020 at 16:58 Comment(0)
L
0

A bit improved version of (non-working) code from Martin Murphy:

static Regex reQuotHex = new Regex(@"=[0-9A-H]{2}", RegexOptions.Multiline|RegexOptions.Compiled);

public static string DecodeQuotedPrintable(string input)
{
    var dic = new Dictionary<string, string>();
    foreach (var qp in new HashSet<string>(reQuotHex.Matches(input).Cast<Match>().Select(m => m.Value)))
        dic[qp] = ((char)Convert.ToInt32(qp.Substring(1), 16)).ToString();
        
    foreach (string qp in dic.Keys) {
        input = input.Replace(qp, dic[qp]);
    }
    return input.Replace("=\r\n", "");
}
Lindemann answered 1/5, 2021 at 1:2 Comment(0)
P
0

Starting from @Dave solution, this decodes quoted printable strings with more than one encoding, for example "=?utf-8?Q?Firststring?=\t=?utf-8?Q?_-_1.250=2C50_=E2=82=AC=5F1000=5F2646.pdf?="

public static string DecodeQuotedPrintable(string text)
{
    Regex quotedPrintableEncodingRegex = new Regex(@"=\?((?!\?=).)*\?=", RegexOptions.IgnoreCase);

    MatchCollection quotedPrintableEncodingMatches = quotedPrintableEncodingRegex.Matches(text);

    if (quotedPrintableEncodingMatches.Count <= 0) 
        return text;

    var decodedText = "";
    foreach (Match match in quotedPrintableEncodingMatches)
    {
        Attachment decodedTextPart = Attachment.CreateAttachmentFromString("", match.Value);
        decodedText += decodedTextPart.Name;
    }
    return decodedText;
}
Pyelography answered 8/12, 2022 at 13:32 Comment(0)
T
-1
public static string DecodeQuotedPrintables(string input, Encoding encoding)
    {
        var regex = new Regex(@"\=(?<Symbol>[0-9A-Z]{2})", RegexOptions.Multiline);
        var matches = regex.Matches(input);
        var bytes = new byte[matches.Count];

        for (var i = 0; i < matches.Count; i++)
        {
            bytes[i] = Convert.ToByte(matches[i].Groups["Symbol"].Value, 16);
        }

        return encoding.GetString(bytes);
    }
Tissue answered 9/4, 2014 at 6:44 Comment(1)
Try to explain your solution a bit. You can do it so by editing your answer.Censer

© 2022 - 2024 — McMap. All rights reserved.