What is "=C2=A0" in MIME encoded, quoted-printable text?
Asked Answered
C

3

71

This is an example raw email I am trying to parse:

MIME-version: 1.0
Content-type: text/html; charset=UTF-8
Content-transfer-encoding: quoted-printable
X-Mailer: Verizon Webmail
X-Originating-IP: [x.x.x.x]

=C2=A0test testing testing 123

What is =C2=A0? I have tried a half dozen quoted-printable parsers, but none handle this correctly. How would one properly parse this in C#?

Honestly, for now, I'm coding:

//TODO WTF
encoded = encoded.Replace("=C2=A0", "");

Because I can't figure out why that text is there randomly within the MIME content, and isn't supposed to be rendered into anything. By just removing it, I'm getting the desired effect - but WHY?!

To be clear, I know that (=[0-9A-F]{2}) is an encoded character. But in this case, it seemingly represents NOTHING.

Chesterfield answered 5/5, 2010 at 15:15 Comment(0)
D
132

=C2=A0 represents the bytes C2 A0. Since this is UTF-8, it translates to U+00A0, which is the Unicode for non-breaking space.

See UTF-8 (Wikipedia).

Dewey answered 5/5, 2010 at 15:20 Comment(12)
What is the way to parse this in C#? All of the parsers I've tried operate on each char independently, and do this: int iHex = Convert.ToInt32(hex, 16); char c = (char)iHex;Chesterfield
Does UTF-8 always encode in 2 bytes like this? Can I assume a match of (=[0-9A-F]{2}=[0-9A-F]{2}) instead of the single byte? Why the hell isn't there a parser for this?!?!?!?!Chesterfield
If you read up on UTF-8, you'll see that any single-byte value that exceeds 7F has to be coded into two characters, and the first one will always have its high bit set. So, yes, A0 is always coded as C2 A0, which means you can't go byte-by-byte. The right way to handle UTF-8 with quoted-encoding is to first decode the quoted part and then decode the UTF-8, resulting in a string composed of 2-byte characters (technically UCS-16 or UTF-16).Dewey
Let me also add that I've used Chilkat's S/MIME control to parse email messages for me, and it does a really good job. It's also quite cheap.Dewey
Thanks Steven. I'll go ahead and purchase that because I'm sick of hacking this crap together. :)Chesterfield
Actually, I love writing MIME parsers and such, but I simply can't justify spending days to produce something with a fraction of the functionality of a cheap, reliable third-party control. Even if I were paid minimum wage, it would not be cost-effective.Dewey
  is the HTML code for this, btw. Just in case anyone cares.Revell
This answer shows how to decode quoted-printable in C# #2227054Millian
@Propend It worked four years ago. You can find the same sort of information now at en.wikipedia.org/wiki/UTF-8Dewey
@StevenSudit just thinking it would be good to update the url in the answerPropend
There are a couple of technical errors in @StevenSudit's comment above. UTF-8 is a variable width encoding: codepoints above U+7F are encoded with at least 2 bytes, but may need 3 or 4 bytes, depending on the value being encoded. UCS-2 (not UCS-16) is a fixed-width 16-bit encoding, but cannot encode all of Unicode, and is rarely used any more. UTF-16, which is more commonly used, is another variable width encoding, with characters taking either 2 or 4 bytes. To represent all Unicode codepoints in a fixed-width encoding you need the 4-byte UCS-4.Cardinal
@Cardinal You're correct about it being UCS-2, not UCS-16. As for the rest, I was silent on the matter of values that cannot be encoded in two bytes, not mistaken. Nothing about this question required delving into that topic.Dewey
G
3

%C2%A0 is a non breaking space

Gaseous answered 10/2, 2021 at 4:18 Comment(0)
I
-15

%C2%A0 This is the code of a hidden folder, create a hidden folder and save in it, for example, a text file, then open this file through a browser and you will see these characters in the search bar. As I understand it, these characters are optional and do not translate to other code.

Indebtedness answered 25/11, 2019 at 12:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.