Why is “-” (hyphen) the unique ASCII limitation for E-mail compatibility?

Asked 4/6, 2018 at 6:36 Answered 20/6, 2018 at 4:20

I was reading this proposal for Base91, (with bold formatting added by me):

All the SMTP-based E-mail can provide compatibility with the E-mail. So-called compatibility with the E-mail is to transform arbitrary 8-bit data byte-strings or arbitrary bit stream data transferred by the E-mail into a character-strings of a limited ASCII. The main limitation on the latter is that:
(a) the characters have to be printable;
(b) the characters are not control character or “-” (hyphen).
There are totally 94 of such ASCII characters, their corresponding digital coding being all integers ranging from 32 through 126 with the exception of 45. E-mail written in these ASCII characters is compatible with the Internet standard SMTP, and can be transferred in nearly all the E-mail systems.

^{Note: 45 is the ASCII value for hyphen.}
^{Note: I just figured out that this proposal originates from patents in China (ZL00112884.1) and US (US6859151B2).}

But I also read the RFC 5321 regarding SMTP, and I couldn't find anything that makes the hyphen character the exclusive limitation from the printable ASCII range.

^{Note: The printable ASCII range is:

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~}

Why is the Base91 proposal/patent claiming that “-” (hyphen) is the only limitation for E-mail compatibility?

Paramecium answered 4/6, 2018 at 6:36 Comment(0)

Looks like the hyphen is used as a control/marker character in multiline SMTP messages.

RFC5321 4.2.1 Reply Code Severities and Theory:

The format for multiline replies requires that every line, except the last, begin with the reply code, followed immediately by a hyphen, "-" (also known as minus), followed by text. The last line will begin with the reply code, followed immediately by <SP>, optionally some text, and <CRLF>. As noted above, servers SHOULD send the <SP> if subsequent text is not sent, but clients MUST be prepared for it to be omitted.

The Base91 proposal uses SMTP as an example of both it's application and limitations. As you state, it originally wanted to use 94 characters, but due to various standards (e.g. SMTP), it excludes commonly used pseudo-control characters ("-", ".", "="). It uses SMTP because it demonstrates the practicality of a Base91 encoding (e.g. encoding 13 bits of data per character rather than 6 bits with Base64 can greatly reduce the amount of bits required to encode any given message) in addition to acknowledging that its usage of hyphens as a control character won't cause ambiguity in Base91 text.

Any text can be encoded by Base91--The paper states that it maps 13 bits of data into two printable ASCII characters. Any number, any character (incl new line characters) can be encoded by Base91, similarly to how any character can be encoded by Base64. Likewise, this mapping can be reversed, to produce the original output from a Base91 encoding.

Here's a multiline SMTP reply code example:

  250-First line
  250-Second line
  250-234 Text beginning with numbers
  250 The last line

In this example, it converts a large multiline SMTP message that contains both hyphens, newlines, and numbers into some Base91-encoded form. If this encoded form contained pseudo-control characters, such as the hyphen, SMTP clients may interpret Base91-encoded data to be malformed SMTP data. The purpose of removing characters such as hyphens from the Base91 character set is not because of flaws with SMTP or the specifications of SMTP itself, it's with clients that use and parse SMTP data, and ensuring that clients can still properly accept Base91 data without any risk of misparsing it as SMTP data.

Straightedge answered 19/6, 2018 at 15:30 Comment(7)

Thank you, but it's not clear how it relates to Base91 encoding: (1) Reply code is digits only, so it can't be Base91 encoded. (2) Base91 doesn't allow new line character, so it can be used for the text without interfering, even if it was containing hyphens itself. – Atheism 19/6, 2018 at 15:52

e.g. encoding 13 bits of data per character rather than 6 bits with Base64 can effectively half the amount of bits required to encode any given message Is it encode 13 bits of data per one 8 bits character? – Barros 19/6, 2018 at 22:54

It's 13 bits per two ascii characters, so each 13-bit block of data to encode is represented by 16-bits of encoded data. – Straightedge 20/6, 2018 at 1:37

@EdwardShen So, it is not half amount, but rather 13 bits in Base91 vs. 12 bits in Base64 per two characters. – Barros 20/6, 2018 at 3:48

13 bits in Base91 into 16 bits versus 6 bits in Base64 into 8 bits. – Straightedge 20/6, 2018 at 3:50

How would base91 ever leak into the status codes of basic SMTP? This speculation doesn't make sense. – Danyel 20/6, 2018 at 4:20

@Danyel I agree it's not the best demonstration. But I'll still grant the bounty as it's the closest explanation found. (the bounty was about "finding credible sources") – Atheism 20/6, 2018 at 12:16

My suspicion is that is simply to make base91 robust against things people sometimes do with text, i.e. copy/paste across documents etc. It doesn't make much sense to expect this to happen a lot, but some word processors etc. will use dashes as hyphenation points.

Danyel answered 20/6, 2018 at 4:20 Comment(0)

Recommended topics

Hot tags