What is the most efficient way to encode an arbitrary GUID into readable ASCII (33-127)?
Asked Answered
R

8

48

The standard string representation of GUID takes about 36 characters. Which is very nice, but also really wasteful. I am wondering, how to encode it in the shortest possible way using all the ASCII characters in the range 33-127. The naive implementation produces 22 characters, simply because 128 bits / 6 bits yields 22.

Huffman encoding is my second best, the only question is how to choose the codes....

The encoding must be lossless, of course.

Raleighraley answered 13/5, 2010 at 14:35 Comment(4)
I'd love to know the point of this. Do you really need to store billions of GUIDs? Because anything less than billions and the significance of even cutting the string length in half is almost not worth the algorithmic hassle.Psychosis
Huffman coding certainly won't work - all symbols in a random GUID are equally likely.Defant
@NickJohnson not sure, because GUID has weird generation rules, one of which comprises the date of generation, which for say, given a 5 years span, the huffman coding during these 5 years may provide a nice reduction. of course I say "may" because I dont know how the date is "hashed". If its badly hashed the huffman could create compression.Chophouse
@Chophouse You can read about GUID schemes on Wikipedia; in a type 4 (random) UUID, all but a few bits are randomly selected.Defant
E
23

Use Base 85. See section 4.1. Why 85? of A Compact Representation of IPv6 Addresses

An IPv6 address, like a GUID is made up of eight 16-bit pieces.

Espousal answered 13/5, 2010 at 14:55 Comment(6)
Looks good, but the characters '%' and '?' are not suitable if the encoding is to be used in database queries. We replace them with ':' and '.', because we do not care about IPv6 formatting issues.Raleighraley
@Mark I could see how this might be a problem, but in many cases if you were storing this in a column and doing TableName.Guid85Encoded like '%blah%' then it won't be a problem because % and ? are on the left hand side of the like operator.Shammer
@CMCDragonkai As the linked RFC states: "any value between 85 and 94 inclusive", but "Selecting 85 allows the use of the smallest possible subset of the ASCII characters". 91 adds complexity without improving efficiency.Espousal
Check this out: iiis.org/CDs2010/CD2010SCI/CCCT_2010/… Base91 has a higher encoding efficiency than Base85/64, and higher encoding rate than Base85. Besides, Base91 provides compatibility with any bit-length input sequence without additional filling declaration except for his codeword self. One can use Base91 as a substitute for Base85 and Base64 to get some benefits in restricted situations.Bulwerlytton
Haha. Good one!Harewood
In case this is not evident: the linked RFC is a 1st April joke. Nobody has seriously proposed to represent IPv6 adresses in Base85. Practically speaking, the drawbacks of having a non-standard representation should outweight the advantage of few saved characters (as compared with say Base64) in 99% of practical cases. If one is extremely interested in saving space at all costs, then just store in binary (16 bytes) and convert to/from text when reading/writing.Fourthclass
F
39

This is an old question, but I had to solve it in order for a system I was working on to be backward compatible.

The exact requirement was for a client-generated identifier that would be written to the database and stored in a 20-character unique column. It never got shown to the user and was not indexed in any way.

Since I couldn't eliminate the requirement, I really wanted to use a Guid (which is statistically unique) and if I could encode it losslessly into 20 characters, then it would be a good solution given the constraints.

Ascii-85 allows you to encode 4 bytes of binary data into 5 bytes of Ascii data. So a 16 byte guid will just fit into 20 Ascii characters using this encoding scheme. A Guid can have 3.1962657931507848761677563491821e+38 discrete values whereas 20 characters of Ascii-85 can have 3.8759531084514355873123178482056e+38 discrete values.

When writing to the database I had some concerns about truncation so no whitespace characters are included in the encoding. I also had issues with collation, which I addressed by excluding lowercase characters from the encoding. Also, it would only ever be passed through a paramaterized command, so any special SQL characters would be escaped automatically.

I've included the C# code to perform Ascii-85 encoding and decoding in case it helps anyone out there. Obviously, depending on your usage you might need to choose a different character set as my constraints made me choose some unusual characters like 'ß' and 'Ø' - but that's the easy part:

/// <summary>
/// This code implements an encoding scheme that uses 85 printable ascii characters 
/// to encode the same volume of information as contained in a Guid.
/// 
/// Ascii-85 can represent 4 binary bytes as 5 Ascii bytes. So a 16 byte Guid can be 
/// represented in 20 Ascii bytes. A Guid can have 
/// 3.1962657931507848761677563491821e+38 discrete values whereas 20 characters of 
/// Ascii-85 can have 3.8759531084514355873123178482056e+38 discrete values.
/// 
/// Lower-case characters are not included in this encoding to avoid collation 
/// issues. 
/// This is a departure from standard Ascii-85 which does include lower case 
/// characters.
/// In addition, no whitespace characters are included as these may be truncated in 
/// the database depending on the storage mechanism - ie VARCHAR vs CHAR.
/// </summary>
internal static class Ascii85
{
    /// <summary>
    /// 85 printable ascii characters with no lower case ones, so database 
    /// collation can't bite us. No ' ' character either so database can't 
    /// truncate it!
    /// Unfortunately, these limitation mean resorting to some strange 
    /// characters like 'Æ' but we won't ever have to type these, so it's ok.
    /// </summary>
    private static readonly char[] kEncodeMap = new[]
    { 
        '0','1','2','3','4','5','6','7','8','9',  // 10
        'A','B','C','D','E','F','G','H','I','J',  // 20
        'K','L','M','N','O','P','Q','R','S','T',  // 30
        'U','V','W','X','Y','Z','|','}','~','{',  // 40
        '!','"','#','$','%','&','\'','(',')','`', // 50
        '*','+',',','-','.','/','[','\\',']','^', // 60
        ':',';','<','=','>','?','@','_','¼','½',  // 70
        '¾','ß','Ç','Ð','€','«','»','¿','•','Ø',  // 80
        '£','†','‡','§','¥'                       // 85
    };

    /// <summary>
    /// A reverse mapping of the <see cref="kEncodeMap"/> array for decoding 
    /// purposes.
    /// </summary>
    private static readonly IDictionary<char, byte> kDecodeMap;

    /// <summary>
    /// Initialises the <see cref="kDecodeMap"/>.
    /// </summary>
    static Ascii85()
    {
        kDecodeMap = new Dictionary<char, byte>();

        for (byte i = 0; i < kEncodeMap.Length; i++)
        {
            kDecodeMap.Add(kEncodeMap[i], i);
        }
    }

    /// <summary>
    /// Decodes an Ascii-85 encoded Guid.
    /// </summary>
    /// <param name="ascii85Encoding">The Guid encoded using Ascii-85.</param>
    /// <returns>A Guid decoded from the parameter.</returns>
    public static Guid Decode(string ascii85Encoding)
    { 
        // Ascii-85 can encode 4 bytes of binary data into 5 bytes of Ascii.
        // Since a Guid is 16 bytes long, the Ascii-85 encoding should be 20
        // characters long.
        if(ascii85Encoding.Length != 20)
        {
            throw new ArgumentException(
                "An encoded Guid should be 20 characters long.", 
                "ascii85Encoding");
        }

        // We only support upper case characters.
        ascii85Encoding = ascii85Encoding.ToUpper();

        // Split the string in half and decode each substring separately.
        var higher = ascii85Encoding.Substring(0, 10).AsciiDecode();
        var lower = ascii85Encoding.Substring(10, 10).AsciiDecode();

        // Convert the decoded substrings into an array of 16-bytes.
        var byteArray = new[]
        {
            (byte)((higher & 0xFF00000000000000) >> 56),        
            (byte)((higher & 0x00FF000000000000) >> 48),        
            (byte)((higher & 0x0000FF0000000000) >> 40),        
            (byte)((higher & 0x000000FF00000000) >> 32),        
            (byte)((higher & 0x00000000FF000000) >> 24),        
            (byte)((higher & 0x0000000000FF0000) >> 16),        
            (byte)((higher & 0x000000000000FF00) >> 8),         
            (byte)((higher & 0x00000000000000FF)),  
            (byte)((lower  & 0xFF00000000000000) >> 56),        
            (byte)((lower  & 0x00FF000000000000) >> 48),        
            (byte)((lower  & 0x0000FF0000000000) >> 40),        
            (byte)((lower  & 0x000000FF00000000) >> 32),        
            (byte)((lower  & 0x00000000FF000000) >> 24),        
            (byte)((lower  & 0x0000000000FF0000) >> 16),        
            (byte)((lower  & 0x000000000000FF00) >> 8),         
            (byte)((lower  & 0x00000000000000FF)),  
        };

        return new Guid(byteArray);
    }

    /// <summary>
    /// Encodes binary data into a plaintext Ascii-85 format string.
    /// </summary>
    /// <param name="guid">The Guid to encode.</param>
    /// <returns>Ascii-85 encoded string</returns>
    public static string Encode(Guid guid)
    {
        // Convert the 128-bit Guid into two 64-bit parts.
        var byteArray = guid.ToByteArray();
        var higher = 
            ((UInt64)byteArray[0] << 56) | ((UInt64)byteArray[1] << 48) | 
            ((UInt64)byteArray[2] << 40) | ((UInt64)byteArray[3] << 32) |
            ((UInt64)byteArray[4] << 24) | ((UInt64)byteArray[5] << 16) | 
            ((UInt64)byteArray[6] << 8)  | byteArray[7];

        var lower = 
            ((UInt64)byteArray[ 8] << 56) | ((UInt64)byteArray[ 9] << 48) | 
            ((UInt64)byteArray[10] << 40) | ((UInt64)byteArray[11] << 32) |
            ((UInt64)byteArray[12] << 24) | ((UInt64)byteArray[13] << 16) | 
            ((UInt64)byteArray[14] << 8)  | byteArray[15];

        var encodedStringBuilder = new StringBuilder();

        // Encode each part into an ascii-85 encoded string.
        encodedStringBuilder.AsciiEncode(higher);
        encodedStringBuilder.AsciiEncode(lower);

        return encodedStringBuilder.ToString();
    }

    /// <summary>
    /// Encodes the given integer using Ascii-85.
    /// </summary>
    /// <param name="encodedStringBuilder">The <see cref="StringBuilder"/> to 
    /// append the results to.</param>
    /// <param name="part">The integer to encode.</param>
    private static void AsciiEncode(
        this StringBuilder encodedStringBuilder, UInt64 part)
    {
        // Nb, the most significant digits in our encoded character will 
        // be the right-most characters.
        var charCount = (UInt32)kEncodeMap.Length;

        // Ascii-85 can encode 4 bytes of binary data into 5 bytes of Ascii.
        // Since a UInt64 is 8 bytes long, the Ascii-85 encoding should be 
        // 10 characters long.
        for (var i = 0; i < 10; i++)
        {
            // Get the remainder when dividing by the base.
            var remainder = part % charCount;

            // Divide by the base.
            part /= charCount;

            // Add the appropriate character for the current value (0-84).
            encodedStringBuilder.Append(kEncodeMap[remainder]);
        }
    }

    /// <summary>
    /// Decodes the given string from Ascii-85 to an integer.
    /// </summary>
    /// <param name="ascii85EncodedString">Decodes a 10 character Ascii-85 
    /// encoded string.</param>
    /// <returns>The integer representation of the parameter.</returns>
    private static UInt64 AsciiDecode(this string ascii85EncodedString)
    {
        if (ascii85EncodedString.Length != 10)
        {
            throw new ArgumentException(
                "An Ascii-85 encoded Uint64 should be 10 characters long.", 
                "ascii85EncodedString");
        }

        // Nb, the most significant digits in our encoded character 
        // will be the right-most characters.
        var charCount = (UInt32)kEncodeMap.Length;
        UInt64 result = 0;

        // Starting with the right-most (most-significant) character, 
        // iterate through the encoded string and decode.
        for (var i = ascii85EncodedString.Length - 1; i >= 0; i--)
        {
            // Multiply the current decoded value by the base.
            result *= charCount;

            // Add the integer value for that encoded character.
            result += kDecodeMap[ascii85EncodedString[i]];
        }

        return result;
    }
}

Also, here are the unit tests. They aren't as thorough as I'd like, and I don't like the non-determinism of where Guid.NewGuid() is used, but they should get you started:

/// <summary>
/// Tests to verify that the Ascii-85 encoding is functioning as expected.
/// </summary>
[TestClass]
[UsedImplicitly]
public class Ascii85Tests
{
    [TestMethod]
    [Description("Ensure that the Ascii-85 encoding is correct.")]
    [UsedImplicitly]
    public void CanEncodeAndDecodeAGuidUsingAscii85()
    {
        var guidStrings = new[]
        {
            "00000000-0000-0000-0000-000000000000",
            "00000000-0000-0000-0000-0000000000FF",
            "00000000-0000-0000-0000-00000000FF00",
            "00000000-0000-0000-0000-000000FF0000",
            "00000000-0000-0000-0000-0000FF000000",
            "00000000-0000-0000-0000-00FF00000000",
            "00000000-0000-0000-0000-FF0000000000",
            "00000000-0000-0000-00FF-000000000000",
            "00000000-0000-0000-FF00-000000000000",
            "00000000-0000-00FF-0000-000000000000",
            "00000000-0000-FF00-0000-000000000000",
            "00000000-00FF-0000-0000-000000000000",
            "00000000-FF00-0000-0000-000000000000",
            "000000FF-0000-0000-0000-000000000000",
            "0000FF00-0000-0000-0000-000000000000",
            "00FF0000-0000-0000-0000-000000000000",
            "FF000000-0000-0000-0000-000000000000",
            "FF000000-0000-0000-0000-00000000FFFF",
            "00000000-0000-0000-0000-0000FFFF0000",
            "00000000-0000-0000-0000-FFFF00000000",
            "00000000-0000-0000-FFFF-000000000000",
            "00000000-0000-FFFF-0000-000000000000",
            "00000000-FFFF-0000-0000-000000000000",
            "0000FFFF-0000-0000-0000-000000000000",
            "FFFF0000-0000-0000-0000-000000000000",
            "00000000-0000-0000-0000-0000FFFFFFFF",
            "00000000-0000-0000-FFFF-FFFF00000000",
            "00000000-FFFF-FFFF-0000-000000000000",
            "FFFFFFFF-0000-0000-0000-000000000000",
            "00000000-0000-0000-FFFF-FFFFFFFFFFFF",
            "FFFFFFFF-FFFF-FFFF-0000-000000000000",
            "FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF",
            "1000000F-100F-100F-100F-10000000000F"
        };

        foreach (var guidString in guidStrings)
        {
            var guid = new Guid(guidString);
            var encoded = Ascii85.Encode(guid);

            Assert.AreEqual(
                20, 
                encoded.Length, 
                "A guid encoding should not exceed 20 characters.");

            var decoded = Ascii85.Decode(encoded);

            Assert.AreEqual(
                guid, 
                decoded, 
                "The guids are different after being encoded and decoded.");
        }
    }

    [TestMethod]
    [Description(
        "The Ascii-85 encoding is not susceptible to changes in character case.")]
    [UsedImplicitly]
    public void Ascii85IsCaseInsensitive()
    {
        const int kCount = 50;

        for (var i = 0; i < kCount; i++)
        {
            var guid = Guid.NewGuid();

            // The encoding should be all upper case. A reliance 
            // on mixed case will make the generated string 
            // vulnerable to sql collation.
            var encoded = Ascii85.Encode(guid);

            Assert.AreEqual(
                encoded, 
                encoded.ToUpper(), 
                "The Ascii-85 encoding should produce only uppercase characters.");
        }
    }
}

I hope this saves somebody some trouble.

Also, if you find any bugs then let me know ;-)

Fantoccini answered 18/11, 2010 at 2:21 Comment(2)
Beware, the code returns characters such as ¾ which are above 127, and not scrictly ASCII. Depending on your encoding (e.g. UTF-8), you might get escape sequences. In an UTF-8 encoded string, every character code in the 128-255 range uses 2 bytes.Benbow
Honestly, why Ascii85? Just because it seems 'standardized'? Might as well use base95. Neither are URL-safe or HTML-safe.Liaoning
E
23

Use Base 85. See section 4.1. Why 85? of A Compact Representation of IPv6 Addresses

An IPv6 address, like a GUID is made up of eight 16-bit pieces.

Espousal answered 13/5, 2010 at 14:55 Comment(6)
Looks good, but the characters '%' and '?' are not suitable if the encoding is to be used in database queries. We replace them with ':' and '.', because we do not care about IPv6 formatting issues.Raleighraley
@Mark I could see how this might be a problem, but in many cases if you were storing this in a column and doing TableName.Guid85Encoded like '%blah%' then it won't be a problem because % and ? are on the left hand side of the like operator.Shammer
@CMCDragonkai As the linked RFC states: "any value between 85 and 94 inclusive", but "Selecting 85 allows the use of the smallest possible subset of the ASCII characters". 91 adds complexity without improving efficiency.Espousal
Check this out: iiis.org/CDs2010/CD2010SCI/CCCT_2010/… Base91 has a higher encoding efficiency than Base85/64, and higher encoding rate than Base85. Besides, Base91 provides compatibility with any bit-length input sequence without additional filling declaration except for his codeword self. One can use Base91 as a substitute for Base85 and Base64 to get some benefits in restricted situations.Bulwerlytton
Haha. Good one!Harewood
In case this is not evident: the linked RFC is a 1st April joke. Nobody has seriously proposed to represent IPv6 adresses in Base85. Practically speaking, the drawbacks of having a non-standard representation should outweight the advantage of few saved characters (as compared with say Base64) in 99% of practical cases. If one is extremely interested in saving space at all costs, then just store in binary (16 bytes) and convert to/from text when reading/writing.Fourthclass
C
14

You have 95 characters available -- so, more than 6 bits, but not quite as many as 7 (about 6.57 actually). You could use 128/log2(95) = about 19.48 characters, to encode into 20 characters. If saving 2 characters in the encoded form is worth the loss of readability to you, something like (pseudocode):

char encoded[21];
long long guid;    // 128 bits number

for(int i=0; i<20; ++i) {
  encoded[i] = chr(guid % 95 + 33);
  guid /= 95;
}
encoded[20] = chr(0);

which is basically the generic "encode a number in some base" code, except that there's no need to reverse the "digits" since the order's arbitrary anyway (and little-endian is more direct and natural). To get back the guid from the encoded string is, in a very similar way, the polynomial computation in base 95 (after subtracting 33 from each digit of course):

guid = 0;

for(int i=0; i<20; ++i) {
  guid *= 95;
  guid += ord(encoded[i]) - 33;
}

essentially using Horner's approach to polynomial evaluation.

Cadaver answered 13/5, 2010 at 14:46 Comment(1)
You've a bug in your decoding snippet, it actually does need to iterate over the string in reverse. for(int i = 20; i > 0; --i) { guid *= 95; guid += ord(encoded[i-1]) - 33; }Tifanytiff
F
5

Simply go Base64.

Fourthclass answered 13/5, 2010 at 14:39 Comment(3)
I prefer this because it is much easier to find implementations of base64 encoding than base85, and base85 only nets you 2 more characters saved. If your system that stores these guids in base85 ever becomes a legacy system and someone has to port/export these things into another system with another programming language and decode the GUIDs, then base85 encoding might be a hurdle given that implementations are not as widely available.Shammer
There are actually 66 unreserved URL characters (see tools.ietf.org/html/rfc3986#section-2.3) so you can write a custom base-66 encoder that gets slightly better than naive Base64.Sexism
Base-66 encoder available here: https://mcmap.net/q/100621/-how-to-convert-an-integer-to-the-shortest-url-safe-string-in-pythonEnface
R
3

Using the full range from 33 (what's wrong wirh space, incidentally?) to 127 gives you 95 possible characters. Expressing the 2^128 possible values of guid in base 95 will use 20 characters. This (modulo things like dropping nybbles that will be constant) is the best you can do. Save yourself the trouble - use base 64.

Ramification answered 13/5, 2010 at 14:44 Comment(2)
128 bits in base95 is 19.5 digits to be more accurate. If you consider the length free, you'll get that on average if omit most significant zeros.Claymore
The exclusion of space is likely because it is has no visible character, and is considered whitespace, which in some cases (often leading or trailing spaces), are stripped by both human and computer. Forgive the almost 12 year bump, but no one answered you thus far and it is worth a comment. Also, base95 is not URL or HTML safe, so some characters require escaping, which can expand string length under that restriction.Liaoning
B
0

Assuming that all of your GUIDs are being generated by the same algorithm, you can save 4 bits by not encoding the algorithm nibble, before applying any other encoding :-|

Bradfordbradlee answered 13/5, 2010 at 14:35 Comment(0)
J
0

An arbitrary GUID? The "naive" algorithm will produce optimal results. The only way to compress a GUID further is to make use of patterns in the data excluded by your "arbitrary" constraint.

Junina answered 13/5, 2010 at 14:45 Comment(2)
Did you read the question? He's asking how to encode it more efficiently, not how to compress it.Defant
Do you understand that encoding 128 bits more efficiently than 22 6-bit characters requires some form of compression and some pattern in the input?Junina
M
0

I agree with the Base64 approach. It will cut back a 32-letter UUID to 22-letter Base64.

Here are simple Hex <-> Base64 converting functions for PHP:

function hex_to_base64($hex){
  $return = '';
  foreach(str_split($hex, 2) as $pair){
    $return .= chr(hexdec($pair));
  }
  return preg_replace("/=+$/", "", base64_encode($return)); // remove the trailing = sign, not needed for decoding in PHP.
}

function base64_to_hex($base64) {
  $return = '';
  foreach (str_split(base64_decode($base64), 1) as $char) {
      $return .= str_pad(dechex(ord($char)), 2, "0", STR_PAD_LEFT);
  }
  return $return;
}
Mallarme answered 27/12, 2019 at 21:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.