Python: Compactly and reversibly encode large integer as base64 or base16 having variable or fixed length
Asked Answered
B

3

1

I want to compactly encode a large unsigned or signed integer having an arbitrary number of bits into a base64, base32, or base16 (hexadecimal) representation. The output will ultimately be used as a string which will be used as a filename, but this should be beside the point. I am using the latest Python 3.

This works but is far from compact:

>>> import base64, sys
>>> i: int = 2**62 - 3  # Can be signed or unsigned.
>>> b64: bytes =  base64.b64encode(str(i).encode()) # Not a compact encoding.
>>> len(b64), sys.getsizeof(b64)
(28, 61)

There is a prior question, now closed, the answers for which strictly concern with inefficient representations. Note again that we don't want to use any strings or needlessly long sequences of bytes in this exercise. As such, this question is not a duplicate of that question.

Bobker answered 11/1, 2019 at 19:19 Comment(6)
What is the goal of the base64 encoding here? What kind of data is being stored, are you encoding to make transfer easy, what kind of clients need to read this data again?Gel
I'm asking for context because it is possible, almost likely, that there are better options available. base64 is not a common encoding for individual integer values.Gel
@A-B-B: if the goal is to embed the value, compactly, into a string, then there are more options available. Is base85 acceptable? What restrictions do you place on the value, characterwise?Gel
Next question: what filesystem and OS are you targeting? Some filesystems are case-insensitive (even though they may preserve case), but base64 is case sensitive. You can trivially create pairs of base64 strings that will map to the same file on such a filesystem.Gel
@A-B-B; base85 is not safe for filenames, so no, that'd not be a better answer at this point.Gel
the goal is to encode an int as a string efficiently ans using as many builtins as possible and have it not be limited by the size of the string. the best way to do this is to a) convert the int.to_bytes, then convert the bytes to a string. selecting the base64 "safe" versions will allow embedding in filenames and urls.Baugher
B
4

This answer is motivated in part by disparate comments by Erik A., such as for this answer. The integer is first compactly converted to bytes, following which the bytes are encoded to a variable base.

from typing import Callable, Optional
import base64

class IntBaseEncoder:
    """Reversibly encode an unsigned or signed integer into a customizable encoding of a variable or fixed length."""
    # Ref: https://stackoverflow.com/a/54152763/
    def __init__(self, encoding: str, *, bits: Optional[int] = None, signed: bool = False):
        """
        :param encoder: Name of encoding from base64 module, e.g. b64, urlsafe_b64, b32, b16, etc.
        :param bits: Max bit length of int which is to be encoded. If specified, the encoding is of a fixed length,
        otherwise of a variable length.
        :param signed: If True, integers are considered signed, otherwise unsigned.
        """
        self._decoder: Callable[[bytes], bytes] = getattr(base64, f'{encoding}decode')
        self._encoder: Callable[[bytes], bytes] = getattr(base64, f'{encoding}encode')
        self.signed: bool = signed
        self.bytes_length: Optional[int] = bits and self._bytes_length(2 ** bits - 1)

    def _bytes_length(self, i: int) -> int:
        return (i.bit_length() + 7 + self.signed) // 8

    def encode(self, i: int) -> bytes:
        length = self.bytes_length or self._bytes_length(i)
        i_bytes = i.to_bytes(length, byteorder='big', signed=self.signed)
        return self._encoder(i_bytes)

    def decode(self, b64: bytes) -> int:
        i_bytes = self._decoder(b64)
        return int.from_bytes(i_bytes, byteorder='big', signed=self.signed)

# Tests:
import unittest

class TestIntBaseEncoder(unittest.TestCase):

    ENCODINGS = ('b85', 'b64', 'urlsafe_b64', 'b32', 'b16')

    def test_unsigned_with_variable_length(self):
        for encoding in self.ENCODINGS:
            encoder = IntBaseEncoder(encoding)
            previous_length = 0
            for i in range(1234):
                encoded = encoder.encode(i)
                self.assertGreaterEqual(len(encoded), previous_length)
                self.assertEqual(i, encoder.decode(encoded))

    def test_signed_with_variable_length(self):
        for encoding in self.ENCODINGS:
            encoder = IntBaseEncoder(encoding, signed=True)
            previous_length = 0
            for i in range(-1234, 1234):
                encoded = encoder.encode(i)
                self.assertGreaterEqual(len(encoded), previous_length)
                self.assertEqual(i, encoder.decode(encoded))

    def test_unsigned_with_fixed_length(self):
        for encoding in self.ENCODINGS:
            for maxint in range(257):
                encoder = IntBaseEncoder(encoding, bits=maxint.bit_length())
                maxlen = len(encoder.encode(maxint))
                for i in range(maxint + 1):
                    encoded = encoder.encode(i)
                    self.assertEqual(len(encoded), maxlen)
                    self.assertEqual(i, encoder.decode(encoded))

    def test_signed_with_fixed_length(self):
        for encoding in self.ENCODINGS:
            for maxint in range(257):
                encoder = IntBaseEncoder(encoding, bits=maxint.bit_length(), signed=True)
                maxlen = len(encoder.encode(maxint))
                for i in range(-maxint, maxint + 1):
                    encoded = encoder.encode(i)
                    self.assertEqual(len(encoded), maxlen)
                    self.assertEqual(i, encoder.decode(encoded))

if __name__ == '__main__':
    unittest.main()

If using the output as a filename, initializing the encoder with the encoding 'urlsafe_b64' or even 'b16' are safer choices.

Usage examples:

# Variable length encoding
>>> encoder = IntBaseEncoder('urlsafe_b64')
>>> encoder.encode(12345)
b'MDk='
>>> encoder.decode(_)
12345

# Fixed length encoding
>>> encoder = IntBaseEncoder('b16', bits=32)
>>> encoder.encode(12345)
b'00003039'
>>> encoder.encode(123456789)
b'075BCD15'
>>> encoder.decode(_)
123456789

# Signed
encoder = IntBaseEncoder('b32', signed=True)
encoder.encode(-12345)
b'Z7DQ===='
encoder.decode(_)
-12345
Bobker answered 11/1, 2019 at 19:19 Comment(0)
S
3

The following snippet from this answer should suit your needs, with the advantage of having no dependencies:

def v2r(n, base): # value to representation
    """
    Convert a positive integer to its string representation in a custom base.
    
    :param n: the numeric value to be represented by the custom base
    :param base: the custom base defined as a string of characters, used as symbols of the base
    :returns: the string representation of natural number n in the custom base
    """
    if n == 0: return base[0]
    b = len(base)
    digits = ''
    while n > 0:
        digits = base[n % b] + digits
        n  = n // b
    return digits

It does not performs directly a typical base64 conversion (although it can be used to obtain it) but the outcome is similar, since it returns the representation of a big integer (only positive, but you can easily overcome such limitation) in a custom-length numeric base made of custom symbols.

Some examples show better than any word its simple and versatile usage:

# base64 filename-safe characters
# perform a base64 conversion if applied to multiples of 3-bytes chunks
>>> v2r(4276803,'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_')
'QUJD'

# hexadecimal base
>>> v2r(123456789,'0123456789ABCDEF')
'75BCD15'
>>> v2r(255,'0123456789ABCDEF')
'FF'

# custom base of 62 filename-safe characters
>>> v2r(123456789,'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
'8m0Kx'

# custom base of 36 filename-safe lowercase characters for case insensitive file systems
>>> v2r(123456789,'0123456789abcdefghijklmnopqrstuvwxyz')
'21i3v9'

# binary conversion
>>> v2r(123456789,'01')
'111010110111100110100010101'
>>> v2r(255,'01')
'11111111'
Subtitle answered 7/3, 2021 at 2:14 Comment(0)
S
-1
  1. Determine the byte count you want/expect:

    If your data is 0 to 255 use 1

    If your is 0 to 65535 use 2

    int(log base 256(x)) + 1

  2. Convert your integer to a bytearrray using to_bytes(num_bytes)

  3. Decode your bytearray as a Base64 string

  4. Convert to an ascii string if desired

Example:

import datetime
import base64

my_integer = int(datetime.datetime.now().timestamp())
container_size = 4 # The data size of the timestamp data 
# If you don't know the output container size, you will have to compute it 


# As a one liner:
base64.b64encode(my_integer).to_bytes(container_size)).decode('ascii')

# More clearly

my_bytearray = my_integer.to_bytes(container_size)
my_64_str = base64.b64encode(my_bytearray)

print(bin(my_integer)))
print my_integer
print(my_64_str)  # b64 string
print(my_64_str.decode('ascii')  # ascii string


Output:

0b1100101011101111001100111010111
1702336983
b'ZXeZ1w=='
ZXeZ1w==

I know that my data

  1. will fit into a 4 byte packet
  2. will always be positive

If negative integers are expected, they could be represented using 2's complement or you could add a sign bit to a known position in the array.

Stamp answered 11/12, 2023 at 23:51 Comment(3)
Please edit to provide reusable encode and decode typed functions if at all possible, with everything considered, and with the necessary assumptions as assertions. Also, certainly keep the example usage totally separate from the implementation.Bobker
@asclepius Perhaps I should clarify: base64 can already encode arbitrary bits to a string compactly. Just don't convert to a string first. There is no need to wrap base64 in a class to accomplish one line tasks.Stamp
No one asked you to write a class. You were asked to provide functions, only so your implementation can be assessed without any ambiguity. You're mixing implementation and testing code, and your one-liner implementation of base64.b64encode(my_integer).to_bytes(container_size)).decode('ascii') doesn't even run. Furthermore, you've basically attempted to implement just a portion of the functionality that is already in the accepted answer, with nothing unique of relevance to add.Bobker

© 2022 - 2024 — McMap. All rights reserved.