How to convert an integer to the shortest url-safe string in Python?
Asked Answered
N

15

69

I want the shortest possible way of representing an integer in a URL. For example, 11234 can be shortened to '2be2' using hexadecimal. Since base64 uses is a 64 character encoding, it should be possible to represent an integer in base64 using even less characters than hexadecimal. The problem is I can't figure out the cleanest way to convert an integer to base64 (and back again) using Python.

The base64 module has methods for dealing with bytestrings - so maybe one solution would be to convert an integer to its binary representation as a Python string... but I'm not sure how to do that either.

Newfangled answered 18/2, 2009 at 15:25 Comment(4)
Simon: please look at Øystein krog's answer. You want to use a "base 64" representation of your integer data, and NOT the base64 module, which is meant to encode arbitrary binary data and doesn't compress the text representation of numbers. See en.wikipedia.org/wiki/Base_64 )Moderato
I was hoping it was possible to reuse the existing base64 module for part of the work, but sadly it looks like that's not the case. Thanks everyone for all of the excellent responses.Newfangled
For anyone who's interested, I ended up rolling my own code for doing this: djangosnippets.org/snippets/1431Newfangled
After reading Ricardo's comment about Øystein Krog's answers (which didnt have any code), i wrote some very basic Python right at the bottom with 0 votes :PRight
I
64

This answer is similar in spirit to Douglas Leeder's, with the following changes:

  • It doesn't use actual Base64, so there's no padding characters
  • Instead of converting the number first to a byte-string (base 256), it converts it directly to base 64, which has the advantage of letting you represent negative numbers using a sign character.

    import string
    ALPHABET = string.ascii_uppercase + string.ascii_lowercase + \
               string.digits + '-_'
    ALPHABET_REVERSE = dict((c, i) for (i, c) in enumerate(ALPHABET))
    BASE = len(ALPHABET)
    SIGN_CHARACTER = '$'
    
    def num_encode(n):
        if n < 0:
            return SIGN_CHARACTER + num_encode(-n)
        s = []
        while True:
            n, r = divmod(n, BASE)
            s.append(ALPHABET[r])
            if n == 0: break
        return ''.join(reversed(s))
    
    def num_decode(s):
        if s[0] == SIGN_CHARACTER:
            return -num_decode(s[1:])
        n = 0
        for c in s:
            n = n * BASE + ALPHABET_REVERSE[c]
        return n
    

    >>> num_encode(0)
    'A'
    >>> num_encode(64)
    'BA'
    >>> num_encode(-(64**5-1))
    '$_____'

A few side notes:

  • You could (marginally) increase the human-readibility of the base-64 numbers by putting string.digits first in the alphabet (and making the sign character '-'); I chose the order that I did based on Python's urlsafe_b64encode.
  • If you're encoding a lot of negative numbers, you could increase the efficiency by using a sign bit or one's/two's complement instead of a sign character.
  • You should be able to easily adapt this code to different bases by changing the alphabet, either to restrict it to only alphanumeric characters or to add additional "URL-safe" characters.
  • I would recommend against using a representation other than base 10 in URIs in most cases—it adds complexity and makes debugging harder without significant savings compared to the overhead of HTTP—unless you're going for something TinyURL-esque.
Imposture answered 18/2, 2009 at 16:0 Comment(7)
Voted up to have thought about negative numbers. But isn’t one byte for the sign a bit expensive?Christianchristiana
Yes, it is, which I addressed somewhat in my second note; but if that's not a concern, the implementation using a sign character was the simplest ;)Imposture
The initial place I want to use this is "recover your account" style URLs which include a user ID, a timestamp and an sha1 hash - and should ideally be less than 80 characters to ensure they can be safely e-mailed without text wrapping screwing them up.Newfangled
That's really good code but, according to Alex Martelli (https://mcmap.net/q/45142/-how-do-i-reverse-a-string-in-python/…), s[::-1] would be a faster way to reverse a stringDonia
@hwiechers: s isn't actually a string, it's a list, so I still have to join it; I could do ''.join(s[::-1]) or ''.join(s)[::-1], but those are only somewhat faster—far less than the order of magnitude seen in telliott99's microbenchmark for reversing a string.Imposture
Remove ":" at line 11 of code, and it's work without Syntax error :DSesqui
num_encode(64) should map to 'AA'. After all, in decimal, 9+1 != 20Dextrosinistral
B
20

All the answers given regarding Base64 are very reasonable solutions. But they're technically incorrect. To convert an integer to the shortest URL safe string possible, what you want is base 66 (there are 66 URL safe characters).

That code looks something like this:

from io import StringIO
import urllib

BASE66_ALPHABET = u"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_.~"
BASE = len(BASE66_ALPHABET)

def hexahexacontadecimal_encode_int(n):
    if n == 0:
        return BASE66_ALPHABET[0].encode('ascii')

    r = StringIO()
    while n:
        n, t = divmod(n, BASE)
        r.write(BASE66_ALPHABET[t])
    return r.getvalue().encode('ascii')[::-1]

Here's a complete implementation of a scheme like this, ready to go as a pip installable package:

https://github.com/aljungberg/hhc

Belia answered 1/8, 2013 at 18:8 Comment(4)
~ is considered as unsafe in RFC 1738: Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`". — found on tantek.pbworks.com/w/page/24308279/NewBase64Aerialist
That's interesting. RFC 3986 on URIs is newer though and seems to partially obsolete RFC 1738. On a more practical note, ~ is used in URLs all the time. E.g. consider example.com/~user/, a classic URL going back to very early web days.Belia
jkorpela.fi/tilde.html states a couple of reasons not to use tilde in URLs mostly centered on readability. But base64 is not really supposed to be human-readable. Personally I think artificial limits for "compatibility" reasons is nonsense. For example, when searching Google, Firefox doesn't escape !\"'()*-.<>[\\]^_`{|}~+, while Chrome allows just "*-.<>_~, and then Non-ASCII/UTF-8 characters: ¡¢£¤¥¦§¨©ª«¬ are all sent in the clear, no percent-encoding needed.Unsavory
Yes, I think with or without tilde, encoded long numbers are not particularly "readable" anyhow. Good point about "*-.<>_~. Would require more research to ensure that all browsers are OK with these.Belia
M
15

You probably do not want real base64 encoding for this - it will add padding etc, potentially even resulting in larger strings than hex would for small numbers. If there's no need to interoperate with anything else, just use your own encoding. Eg. here's a function that will encode to any base (note the digits are actually stored least-significant first to avoid extra reverse() calls:

def make_encoder(baseString):
    size = len(baseString)
    d = dict((ch, i) for (i, ch) in enumerate(baseString)) # Map from char -> value
    if len(d) != size:
        raise Exception("Duplicate characters in encoding string")

    def encode(x):
        if x==0: return baseString[0]  # Only needed if don't want '' for 0
        l=[]
        while x>0:
            l.append(baseString[x % size])
            x //= size
        return ''.join(l)

    def decode(s):
        return sum(d[ch] * size**i for (i,ch) in enumerate(s))

    return encode, decode

# Base 64 version:
encode,decode = make_encoder("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/")

assert decode(encode(435346456456)) == 435346456456

This has the advantage that you can use whatever base you want, just by adding appropriate characters to the encoder's base string.

Note that the gains for larger bases are not going to be that big however. base 64 will only reduce the size to 2/3rds of base 16 (6 bits/char instead of 4). Each doubling only adds one more bit per character. Unless you've a real need to compact things, just using hex will probably be the simplest and fastest option.

Mefford answered 18/2, 2009 at 16:40 Comment(0)
C
9

To encode n:

data = ''
while n > 0:
    data = chr(n & 255) + data
    n = n >> 8
encoded = base64.urlsafe_b64encode(data).rstrip('=')

To decode s:

data = base64.urlsafe_b64decode(s + '===')
decoded = 0
while len(data) > 0:
    decoded = (decoded << 8) | ord(data[0])
    data = data[1:]

In the same spirit as other for some “optimal” encoding, you can use 73 characters according to RFC 1738 (actually 74 if you count “+” as usable):

alphabet = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_`\"!$'()*,-."
encoded = ''
while n > 0:
    n, r = divmod(n, len(alphabet))
    encoded = alphabet[r] + encoded

and the decoding:

decoded = 0
while len(s) > 0:
    decoded = decoded * len(alphabet) + alphabet.find(s[0])
    s = s[1:]
Christianchristiana answered 18/2, 2009 at 16:22 Comment(1)
I adapted this answer for my answer to the question How to make unique short URL with Python?.Kittrell
D
8

The easy bit is converting the byte string to web-safe base64:

import base64
output = base64.urlsafe_b64encode(s)

The tricky bit is the first step - convert the integer to a byte string.

If your integers are small you're better off hex encoding them - see saua

Otherwise (hacky recursive version):

def convertIntToByteString(i):
    if i == 0:
        return ""
    else:
        return convertIntToByteString(i >> 8) + chr(i & 255)
Defraud answered 18/2, 2009 at 15:48 Comment(0)
L
8

You don't want base64 encoding, you want to represent a base 10 numeral in numeral base X.

If you want your base 10 numeral represented in the 26 letters available you could use: http://en.wikipedia.org/wiki/Hexavigesimal. (You can extend that example for a much larger base by using all the legal url characters)

You should atleast be able to get base 38 (26 letters, 10 numbers, +, _)

Linehan answered 18/2, 2009 at 15:48 Comment(1)
You are correct, but he can still use base 64 by using digits, lowercase, uppercase, and -_.Downstairs
O
4

Base64 takes 4 bytes/characters to encode 3 bytes and can only encode multiples of 3 bytes (and adds padding otherwise).

So representing 4 bytes (your average int) in Base64 would take 8 bytes. Encoding the same 4 bytes in hex would also take 8 bytes. So you wouldn't gain anything for a single int.

Olfactory answered 18/2, 2009 at 15:33 Comment(3)
@saua: You forget that each digit only encodes ~3.3 bits while each character of base64 encodes 6, ergo representing an integer in base64 (instead of base 10) will result in a string approx half as long.Downstairs
@Mike I discussed the length of hex (base-16) encoding vs. base64, and due to the padding the length is the same for 4 bytes of data. Of course this changes for longer strings, but the question is explicitely about encoding an int.Olfactory
@saua: But you don't nessesarily have an int that requires 4 whole bytes. Decimal 1 can still be B64 1, and then decimal 64 can be B64 10.Downstairs
V
3

a little hacky, but it works:

def b64num(num_to_encode):
  h = hex(num_to_encode)[2:]     # hex(n) returns 0xhh, strip off the 0x
  h = len(h) & 1 and '0'+h or h  # if odd number of digits, prepend '0' which hex codec requires
  return h.decode('hex').encode('base64') 

you could replace the call to .encode('base64') with something in the base64 module, such as urlsafe_b64encode()

Valera answered 18/2, 2009 at 16:19 Comment(2)
I tried that with 12345. It gave me: 'MDk=\n' That seems to have converted a 5-digit integer into a length 5 string. I can think of easier ways to achieve that :-)Micra
the = and the \n are padding that you can strip offChromosome
G
2

If you are looking for a way to shorten the integer representation using base64, I think you need to look elsewhere. When you encode something with base64 it doesn't get shorter, in fact it gets longer.

E.g. 11234 encoded with base64 would yield MTEyMzQ=

When using base64 you have overlooked the fact that you are not converting just the digits (0-9) to a 64 character encoding. You are converting 3 bytes into 4 bytes so you are guaranteed your base64 encoded string would be 33.33% longer.

Gastrocnemius answered 18/2, 2009 at 15:35 Comment(3)
The first step is to convert the integer to a byte string.Defraud
You are correct if you are encoding a string representation of a decimal number into base 64, but not if you want to encode the number itself into base 64. Each decimal digit encodes ~3.3 bits of info, while each char of base 64 encodes 6 bits of info. Ergo the base64 number will be shorter.Downstairs
"base 64" could mean two different things: "Base64 encoding" and numbers represented in base 64. "\x01".encode("base64") => 'AQ==', whereas 1 represented in base 64 is just 1.Wernerwernerite
D
2

I maintain a little library named zbase62: http://pypi.python.org/pypi/zbase62

With it you can convert from a Python 2 str object to a base-62 encoded string and vice versa:

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> d = os.urandom(32)
>>> d
'C$\x8f\xf9\x92NV\x97\x13H\xc7F\x0c\x0f\x8d9}\xf5.u\xeeOr\xc2V\x92f\x1b=:\xc3\xbc'
>>> from zbase62 import zbase62
>>> encoded = zbase62.b2a(d)
>>> encoded
'Fv8kTvGhIrJvqQ2oTojUGlaVIxFE1b6BCLpH8JfYNRs'
>>> zbase62.a2b(encoded)
'C$\x8f\xf9\x92NV\x97\x13H\xc7F\x0c\x0f\x8d9}\xf5.u\xeeOr\xc2V\x92f\x1b=:\xc3\xbc'

However, you still need to convert from integer to str. This comes built-in to Python 3:

Python 3.2 (r32:88445, Mar 25 2011, 19:56:22)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> d = os.urandom(32)
>>> d
b'\xe4\x0b\x94|\xb6o\x08\xe9oR\x1f\xaa\xa8\xe8qS3\x86\x82\t\x15\xf2"\x1dL%?\xda\xcc3\xe3\xba'
>>> int.from_bytes(d, 'big')
103147789615402524662804907510279354159900773934860106838120923694590497907642
>>> x= _ 
>>> x.to_bytes(32, 'big')
b'\xe4\x0b\x94|\xb6o\x08\xe9oR\x1f\xaa\xa8\xe8qS3\x86\x82\t\x15\xf2"\x1dL%?\xda\xcc3\xe3\xba'

To convert from int to bytes and vice versa in Python 2, there is not a convenient, standard way as far as I know. I guess maybe I should copy some implementation, such as this one: https://github.com/warner/foolscap/blob/46e3a041167950fa93e48f65dcf106a576ed110e/foolscap/banana.py#L41 into zbase62 for your convenience.

Descriptive answered 6/7, 2011 at 20:8 Comment(0)
F
2

I needed a signed integer, so I ended up going with:

import struct, base64

def b64encode_integer(i):
   return base64.urlsafe_b64encode(struct.pack('i', i)).rstrip('=\n')

Example:

>>> b64encode_integer(1)
'AQAAAA'
>>> b64encode_integer(-1)
'_____w'
>>> b64encode_integer(256)
'AAEAAA'
Florid answered 10/8, 2011 at 13:23 Comment(0)
U
2

I'm working on making a pip package for this.

I recommend you use my bases.py https://github.com/kamijoutouma/bases.py which was inspired by bases.js

from bases import Bases
bases = Bases()

bases.toBase16(200)                // => 'c8'
bases.toBase(200, 16)              // => 'c8'
bases.toBase62(99999)              // => 'q0T'
bases.toBase(200, 62)              // => 'q0T'
bases.toAlphabet(300, 'aAbBcC')    // => 'Abba'

bases.fromBase16('c8')               // => 200
bases.fromBase('c8', 16)             // => 200
bases.fromBase62('q0T')              // => 99999
bases.fromBase('q0T', 62)            // => 99999
bases.fromAlphabet('Abba', 'aAbBcC') // => 300

refer to https://github.com/kamijoutouma/bases.py#known-basesalphabets for what bases are usable

For your case

I recommend you use either base 32, 58 or 64

Base-64 warning: besides there being several different standards, padding isn't currently added and line lengths aren't tracked. Not recommended for use with APIs that expect formal base-64 strings!

Same goes for base 66 which is currently not supported by both bases.js and bases.py but it might in the future

Undermine answered 27/5, 2015 at 10:37 Comment(0)
M
1

I'd go the 'encode integer as binary string, then base64 encode that' method you suggest, and I'd do it using struct:

>>> import struct, base64
>>> base64.b64encode(struct.pack('l', 47))
'LwAAAA=='
>>> struct.unpack('l', base64.b64decode(_))
(47,)

Edit again: To strip out the extra 0s on numbers that are too small to need full 32-bit precision, try this:

def pad(str, l=4):
    while len(str) < l:
        str = '\x00' + str
    return str

>>> base64.b64encode(struct.pack('!l', 47).replace('\x00', ''))
'Lw=='
>>> struct.unpack('!l', pad(base64.b64decode('Lw==')))
(47,)
Microsurgery answered 18/2, 2009 at 15:27 Comment(1)
@Jorenko: This is far from the most efficient. 47 in base 64 can be represented by a single character (as 47 is less than 64.)Downstairs
R
1

Pure python, no dependancies, no encoding of byte strings etc. , just turning a base 10 int into base 64 int with the correct RFC 4648 characters:

def tetrasexagesimal(number):
    out=""
    while number>=0:
        if number == 0:
            out = 'A' + out
            break
        digit = number % 64
        out = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"[digit] + out
        number /= 64 # //= 64 for py3 (thank spanishgum!)
        if number == 0:
            break
    return out

tetrasexagesimal(1)
Right answered 6/8, 2015 at 2:14 Comment(1)
python3: change number /= 64 to number //= 64Vespine
P
0

As it was mentioned here in comments you can encode a data using 73 characters that are not escaped in URL. I found two places were this Base73 URL encoding is used:

But in fact you may use more characters like /, [, ], :, ; and some others. Those characters are escaped only when you doing encodeURIComponent i.e. you need to pass data via get parameter.

So in fact you can use up to 82 characters. The full alphabet is !$&'()*+,-./0123456789:;=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz~. I sorted all the symbols by their code so when Base82URL numbers are sorted as plain strings they are keep the same order.

I tested in Chrome and Firefox and they are works fine but may be confusing for regular users. But I used such ids for an internal API calls where nobody sees them.

Unsigned integer 32 bit may have a maximum value of 2^32=4294967296 And after encoding to the Base82 it will take 6 chars: $0~]mx.

I don't have a code in Python but here is a JS code that generates a random id (int32 unsigned) and encodes it into the Base82URL:

        /**
         * Convert uint32 number to Base82 url safe
         * @param {int} number
         * @returns {string}
         */
        function toBase82Url(number) {
            // all chars that are not escaped in url
            let keys = "!$&'()*+,-./0123456789:;=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz~"
            let radix = keys.length
            let encoded = []
            do {
                let index = number% radix
                encoded.unshift(keys.charAt(index))
                number = Math.trunc(number / radix)
            } while (number !== 0)
            return encoded .join("")
        }

        function generateToken() {
            let buf = new Uint32Array(1);
            window.crypto.getRandomValues(buf)
            var randomInt = buf[0]
            return toBase82Url(randomInt)
        }
Patrol answered 20/8, 2021 at 22:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.