What is the most efficient binary to text encoding?
Asked Answered
M

7

39

The closest contenders that I could find so far are yEnc (2%) and ASCII85 (25% overhead). There seem to be some issues around yEnc mainly around the fact that it uses an 8-bit character set. Which leads to another thought: is there a binary to text encoding based on the UTF-8 character set?

Mendicant answered 9/6, 2009 at 16:43 Comment(1)
Note that yEnc does not convert binary to text, it converts binary to something that is compatible with the news protocol (NNTP), which does not necessarily meet any character set requirements, let alone that it would be all printable text.Voigt
E
22

This really depends on the nature of the binary data, and the constraints that "text" places on your output.

First off, if your binary data is not compressed, try compressing before encoding. We can then assume that the distribution of 1/0 or individual bytes is more or less random.

Now: why do you need text? Typically, it's because the communication channel does not pass through all characters equally. e.g. you may require pure ASCII text, whose printable characters range from 0x20-0x7E. You have 95 characters to play with. Each character can theoretically encode log2(95) ~= 6.57 bits per character. It's easy to define a transform that comes pretty close.

But: what if you need a separator character? Now you only have 94 characters, etc. So the choice of an encoding really depends on your requirements.

To take an extremely stupid example: if your channel passes all 256 characters without issues, and you don't need any separators, then you can write a trivial transform that achieves 100% efficiency. :-) How to do so is left as an exercise for the reader.

UTF-8 is not a good transport for arbitrarily encoded binary data. It is able to transport values 0x01-0x7F with only 14% overhead. I'm not sure if 0x00 is legal; likely not. But anything above 0x80 expands to multiple bytes in UTF-8. I'd treat UTF-8 as a constrained channel that passes 0x01-0x7F, or 126 unique characters. If you don't need delimeters then you can transmit 6.98 bits per character.

A general solution to this problem: assume an alphabet of N characters whose binary encodings are 0 to N-1. (If the encodings are not as assumed, then use a lookup table to translate between our intermediate 0..N-1 representation and what you actually send and receive.)

Assume 95 characters in the alphabet. Now: some of these symbols will represent 6 bits, and some will represent 7 bits. If we have A 6-bit symbols and B 7-bit symbols, then:

A+B=95 (total number of symbols) 2A+B=128 (total number of 7-bit prefixes that can be made. You can start 2 prefixes with a 6-bit symbol, or one with a 7-bit symbol.)

Solving the system, you get: A=33, B=62. You now build a table of symbols:

Raw     Encoded
000000  0000000
000001  0000001
...
100000  0100000
1000010 0100001
1000011 0100010
...
1111110 1011101
1111111 1011110

To encode, first shift off 6 bits of input. If those six bits are greater or equal to 100001 then shift another bit. Then look up the corresponding 7-bit output code, translate to fit in the output space and send. You will be shifting 6 or 7 bits of input each iteration.

To decode, accept a byte and translate to raw output code. If the raw code is less than 0100001 then shift the corresponding 6 bits onto your output. Otherwise shift the corresponding 7 bits onto your output. You will be generating 6-7 bits of output each iteration.

For uniformly distributed data I think this is optimal. If you know that you have more zeros than ones in your source, then you might want to map the 7-bit codes to the start of the space so that it is more likely that you can use a 7-bit code.

Evolutionist answered 9/6, 2009 at 17:42 Comment(1)
The general solution you describe for encoding each character of N equal-probable characters into B bits or B-1 bits is also called "phase-in codes", "economy codes", or "truncated binary encoding".Tancred
O
11

The short answer would be: No, there still isn't.

I ran into the problem with encoding as much information into JSON string, meaning UTF-8 without control characters, backslash and quotes.

I went out and researched how many bit you can squeeze into valid UTF-8 bytes. I disagree with answers stating that UTF-8 brings too much overhead. It's not true.

If you take into account only one-byte sequences, it's as powerful as standard ASCII. Meaning 7 bits per byte. But if you cut out all special characters you'll be left with something like Ascii85.

But there are fewer control characters in higher planes. So if you use 6-byte chunks you'll be able to encode 5 bytes per chunk. In the output you'll get any combination of UTF-8 characters of any length (for 1 to 6 bytes).

This will give you a better result than Ascii85: 5/6 instead of 4/5, 83% efficiency instead of 80%. In theory it'll get even better with higher chunk length: about 84% at 19-byte chunks.

In my opinion the encoding process becomes too complicated while it provides very little profit. So Ascii85 or some modified version of it (I'm looking at Z85 now) would be better.

Orgel answered 5/8, 2013 at 10:18 Comment(0)
B
11

I searched for most efficient binary to text encoding last year. I realized for myself that compactness is not the only criteria. The most important is where you are able to use encoded string. For example, yEnc has 2% overhead, but it is 8-bit encoding, so its usage is very very limited.

My choice is Z85. It has acceptable 25% overhead, and encoded string can be used almost everywhere: XML, JSON, source code etc. See Z85 specification for details.

Finally, I've written Z85 library in C/C++ and use it in production.

Babbie answered 12/4, 2014 at 21:19 Comment(0)
H
8

According to Wikipedia

basE91 produces the shortest plain ASCII output for compressed 8-bit binary input.

Hexone answered 14/12, 2010 at 16:37 Comment(3)
basE91 is more efficient than base64 and Z85. But careful when displaying its output in HTML. It uses characters like (<, >, &) which should be escaped (Z85 also has this issue).Slavey
Can we do better with UTF-8?Oruro
Yes @Oruro - Base122 Encoding is a space efficient UTF-8 binary-to-text encoding ~14% smaller than equivalent base-64 encoded data: blog.kevinalbs.com/base122#text_encodings_and_utf8Clangor
A
6

Currently base91 is the best encoding if you're limited to ASCII characters only and don't want to use non-printable characters. It also has the advantage of lightning fast encoding/decoding speed because a lookup table can be used, unlike base85 which has to be decoded using slow divisions and much more difficult to vectorize

Going above that base122 will help increasing efficiency a little bit, but it's not 8-bit clean. However because it's based on UTF-8 encoding, it should be fine to use for many purposes. And 8-bit clean is just meaningless nowadays

Note that base122 is in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte. And in reality base122 can be optimized to be more efficient than base-128

Base-122 Encoding

Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.

Base-122 encoding scheme

http://blog.kevinalbs.com/base122

See also How viable is base128 encoding for scenarios like JavaScript strings?

Antioch answered 16/4, 2018 at 16:13 Comment(2)
Just one note: it's seens that it could generate an "control characters" which can't be copied by the user (like copy-paste).Scooter
Only programs can be "not 8 bit clean": software that has issues with text containing bytes in the 128-255 range; either fails to preserve the data, or fails entirely. Data that uses all 8 bits could be said to be "not 7 bit clean", or "using codes beyond ASCII", or "requiring 8 bit clean software", etc.Policeman
I
1

Next to the ones listed on Wikipedia, there is Bommanews:

B-News (or bommanews) was developed to lift the weight of the overhead inherent to UUEncode and Base64 encoding: it uses a new encoding method to stuff binary data in text messages. This method eats more CPU resources, but it manages to lower the loss from approximately 40% for UUEncode to 3.5% (the decimal point between those digits is not dirt on your monitor), while still avoiding the use of ANSI control codes in the message body.

It's comparable to yEnc: source

yEnc is less CPU-intensive than B-News and reaches about the same low level of overhead, but it doesn't avoid the use of all control codes, it just leaves out those that were (experimentally) observed to have undesired effects on some servers, which means that it's somewhat less RFC compliant than B-News.

Incurrence answered 29/3, 2012 at 23:7 Comment(4)
The FAQ of Bommanews does not go into which character-encoding's are supported. I presume most 8 bit code pages, although 7F may be present, and that's a control code in e.g. in the IBM OEM character set. Even in the Windows code pages 81, 8D, 8F, 90, and 9D are control characters. Beware when printing this stuf, because data will be lost.Voigt
@Maarten: B-News used characters 0x20 - 0xFF. Each character was a single digit of a base-224 number, offset by 0x20. Each line of "text" was a huge number that was converted from and to binary in the decoding and encoding process. Yenc uses almost the full 0x00 to 0xFF range, each byte in the binary input simply copied to the text output, escaping only 0x00, 0x0A and 0x0D (and the escape character itself, which I don't remember what that was exactly).Knossos
In the end I have revisited this and voted it down. yEnc and B-news are for handling the news protocol (NNTP if I'm not mistaken) and these encodings do not specifically target a character set such as UTF-8, ASCII or Windows-1252 because of that. Note that this mistake is also kind of present in the question, so I'm just a bit unfair here.Voigt
b-news and yEnc does not play well in web browsers for display purposes. base64 and base91 can be copy-pasted easily, while b-news/yenc cannotSlavey
C
0

If you are looking for an efficient encoding for large alphabets, you might want to try escapeless. Both escapeless252 and yEnc have 1.6% overhead, but with the first it's fixed and known in advance while with the latter it actually ranges from 0 to 100% depending on the distribution of bytes.

Customhouse answered 3/6, 2019 at 17:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.