safe enough 8-character short unique random string
Asked Answered
O

8

75

I am trying to compute 8-character short unique random filenames for, let's say, thousands of files without probable name collision. Is this method safe enough?

base64.urlsafe_b64encode(hashlib.md5(os.urandom(128)).digest())[:8]

Edit

To be clearer, I am trying to achieve simplest possible obfuscation of filenames being uploaded to a storage.

I figured out that 8-character string, random enough, would be very efficient and simple way to store tens of thousands of files without probable collision, when implemented right. I don't need guaranteed uniqueness, only high-enough improbability of name collision (talking about only thousands of names).

Files are being stored in concurrent environment, so incrementing shared counter is achievable, but complicated. Storing counter in database would be inefficient.

I am also facing the fact that random() under some circumstances returns same pseudorandom sequences in different processes.

Osmose answered 21/11, 2012 at 0:52 Comment(9)
By "safe enough", do you mean "I don't have to handle collisions at all", or "collisions will be uncommon enough that it doesn't matter if I handle them inefficiently"?Threadfin
You probably want temp files or generate the names by incrementing a counterMisanthrope
Stop using md5! Its a broken hash function and makes the output LESS RANDOM because it has problems with its PRNG output.Stellastellar
@Rook: +1. People generally use it because "it's so much faster than everything else in the library, and I don't really need secure hashes…" but of course if you don't really need secure hashes, you don't need MD5 either, so why not just do urlsafe_b64encode(os.urandom(6)) in the first place?Threadfin
@Rook Using MD5 here, while pointless, is not going to increase the collision rate.I
@NickJohnson thanks for pointing that out, I don't know where I got that idea.Osmose
@Threadfin by "safe enough" I mean "I don't have to handle collisions at all", because it's near to impossible.Osmose
filename = String.Format("zhry{0:0000}", counter++);Bucephalus
@PhilGan: What language are you writing in there?Threadfin
T
29

Is there a reason you can't use tempfile to generate the names?

Functions like mkstemp and NamedTemporaryFile are absolutely guaranteed to give you unique names; nothing based on random bytes is going to give you that.

If for some reason you don't actually want the file created yet (e.g., you're generating filenames to be used on some remote server or something), you can't be perfectly safe, but mktemp is still safer than random names.

Or just keep a 48-bit counter stored in some "global enough" location, so you guarantee going through the full cycle of names before a collision, and you also guarantee knowing when a collision is going to happen.

They're all safer, and simpler, and much more efficient than reading urandom and doing an md5.

If you really do want to generate random names, ''.join(random.choice(my_charset) for _ in range(8)) is also going to be simpler than what you're doing, and more efficient. Even urlsafe_b64encode(os.urandom(6)) is just as random as the MD5 hash, and simpler and more efficient.

The only benefit of the cryptographic randomness and/or cryptographic hash function is in avoiding predictability. If that's not an issue for you, why pay for it? And if you do need to avoid predictability, you almost certainly need to avoid races and other much simpler attacks, so avoiding mkstemp or NamedTemporaryFile is a very bad idea.

Not to mention that, as Root points out in a comment, if you need security, MD5 doesn't actually provide it.

Threadfin answered 21/11, 2012 at 1:14 Comment(11)
I am renaming files being uploaded to the same place, so there is no use of tempfile. Uuid4 is unique enough for anything, but too long. I'm pretty sure 8 chars should be enough to store thousands of names without collision, but I am not sure how to do that. By hashing urandom, I was trying to achieve better randomness.Osmose
@Osmose A UUID is 'too long' by what standard?I
@NickJohnson it's too long to be pretty:) Long filenames look ugly in listings. As I wrote before, 8 chars should by enough to store tens of hundreds random unique names without probable collision, I just don't know what's the best way to increase the improbability.Osmose
@Osmose If you want short, you should use a counter. By the birthday paradox, a randomly generated ID has to be at least twice as long as a sequential ID, given the same number of identifiers.I
according to my testing, ''.join([random.choice(string.ascii_letters+string.digits+'-_') for ch in range(8)]) gives no collision in millions of strings and is also far more efficient than encoding urandom(). I decided to go this way.Osmose
@zahory: tempfile.mktemp can still generate guaranteed-unique temporary filenames without creating the files. Sure, it isn't as secure as mkstemp, but the same is true for a random generator, and of the two, mktemp is absolutely safe instead of just "pretty safe, hopefully safe enough" from non-malicious collisions. It's also simpler and probably more efficient. So I'd still look into whether you can use it, or something similar. I'd also consider the counter before going to random as a last-ditch fallback.Threadfin
@zahory: You don't need the brackets you added to my answer unless you're using Python 2.3 or earlier. (random.choice(my_charset) for ch in range(8)) is a generator expression, which is a perfectly valid sequence to pass to join. Turning it into a list comprehension forces Python to build a list out of the expression (which isn't going to hurt too badly when you've only got 8 elements, but still, why add even a trivial bit of extra complexity for both yourself and the interpreter if you don't need it?).Threadfin
@Threadfin interestigly enough, you are right with the redundant brackets! I understand the syntax of generator expressions, but havent seen it inside a function call, where the parantheses are for... calling the function. You are actually omitting the syntax needed to construct a generator, and it works.Osmose
@Threadfin thank you for the idea of mktemp. I will have to investigate how efficient it is and what is their method of generating names (does it touch the filesystem, I wonder?).Osmose
@zahory: Actually, the parentheses are not part of the syntax of generator expressions. As with tuples, they're only needed when it would otherwise be ambiguous. The paradigm example in the docs (docs.python.org/3/glossary.html#term-generator-expression) is sum(i*i for i in range(10)). I believe the exact rule for genexps is that if it's already inside parens and doesn't have a comma on either side, it's not ambiguous. (Of course sometimes it may look ambiguous to humans, even if the parser has no problem, in which case it's always safe to add them for clarify.)Threadfin
mktemp only guarantees that the file is unique on one file system. It won't guarantee that the filename on the upload server will be unique. Furthermore, once the file is removed from the local system, the filename is free to be reused.Unearned
N
88

Which method has less collisions, is faster and easier to read?

TLDR

The random_choice is the fastest, has fewer collisions but is IMO slightly harder to read.

The most readable is shortuuid_random but is an external dependency and is slightly slower and has 6x the collisions.

The methods


alphabet = string.ascii_lowercase + string.digits
su = shortuuid.ShortUUID(alphabet=alphabet)

def random_choice():
    return ''.join(random.choices(alphabet, k=8))

def truncated_uuid4():
    return str(uuid.uuid4())[:8]

def shortuuid_random():
    return su.random(length=8)

def secrets_random_choice():
    return ''.join(secrets.choice(alphabet) for _ in range(8))

Results

All methods generate 8-character UUIDs from the abcdefghijklmnopqrstuvwxyz0123456789 alphabet. Collisions are calculated from a single run with 10 million draws. Time is reported in seconds as average function execution ± standard deviation, both calculated over 100 runs of 1,000 draws. Total time is the total execution time of the collision testing.

random_choice: collisions 22 - time (s) 0.00229 ± 0.00016 - total (s) 29.70518
truncated_uuid4: collisions 11711 - time (s) 0.00439 ± 0.00021 - total (s) 54.03649
shortuuid_random: collisions 124 - time (s) 0.00482 ± 0.00029 - total (s) 51.19624
secrets_random_choice: collisions 15 - time (s) 0.02113 ± 0.00072 - total (s) 228.23106

Notes

  • the default shortuuid alphabet has uppercase characters, hence creating fewer collision. To make it a fair comparison we need to select the same alphabet as the other methods.
  • the secrets methods token_hex and token_urlsafe while possibly faster, have different alphabets, hence not eligible for the comparison.
  • the alphabet and class-based shortuuid methods are factored out as module variables, hence speeding up the method execution. This should not affect the TLDR.

Full testing details

import random
import secrets
from statistics import mean
from statistics import stdev
import string
import time
import timeit
import uuid

import shortuuid


alphabet = string.ascii_lowercase + string.digits
su = shortuuid.ShortUUID(alphabet=alphabet)


def random_choice():
    return ''.join(random.choices(alphabet, k=8))


def truncated_uuid4():
    return str(uuid.uuid4())[:8]


def shortuuid_random():
    return su.random(length=8)


def secrets_random_choice():
    return ''.join(secrets.choice(alphabet) for _ in range(8))


def test_collisions(fun):
    out = set()
    count = 0
    for _ in range(10_000_000):
        new = fun()
        if new in out:
            count += 1
        else:
            out.add(new)
    return count


def run_and_print_results(fun):
    round_digits = 5
    now = time.time()
    collisions = test_collisions(fun)
    total_time = round(time.time() - now, round_digits)

    trials = 1_000
    runs = 100
    func_time = timeit.repeat(fun, repeat=runs, number=trials)
    avg = round(mean(func_time), round_digits)
    std = round(stdev(func_time), round_digits)

    print(f'{fun.__name__}: collisions {collisions} - '
          f'time (s) {avg} ± {std} - '
          f'total (s) {total_time}')


if __name__ == '__main__':
    run_and_print_results(random_choice)
    run_and_print_results(truncated_uuid4)
    run_and_print_results(shortuuid_random)
    run_and_print_results(secrets_random_choice)
Newmodel answered 31/5, 2019 at 16:34 Comment(3)
I prefer shortuuid ^^ https://mcmap.net/q/268159/-safe-enough-8-character-short-unique-random-stringMarla
@NamGVU shortuuid is a bit slower and more collisions than the random.choices bt definitely more readable.Newmodel
I don't really get the readability argument. def short_uuid(): ... is easy enough to writeNailbiting
B
68

Your current method should be safe enough, but you could also take a look into the uuid module. e.g.

import uuid

print str(uuid.uuid4())[:8]

Output:

ef21b9ad
Boothman answered 21/11, 2012 at 0:55 Comment(9)
Note that the "4" part of uuid4 is very important. "uuid1" for example generates strings with a fixed prefix based on the host ID.Hassett
This gave me 3 collisions in 100k names. That's probably a bad approach because you're using only 16 charsMisanthrope
I read elsewhere that truncating uuids is not a good way to generate short random strings.Osmose
@zahory: Part of the reason for that is the point BoppreH raised. You have to understand how each type of UUID works before you can know how safe it is to truncate them in different ways.Threadfin
43981 collisions when tried on a list of a million.Snakebite
basically whenever you select a fraction of what UUID4 returned (say, N chars) - you are prone to hitting more and more collisions. You should be using the full generated string, or use a different solution.Levee
FYI. 11526 collisions in 10 million on my M1Agraphia
I think this approach is dangerous unless you are coding with the knowledge that collisions aren't rare, but will occur frequently. In fact this approach gives an order of magnitude more collisions than other solutions. See @Oleg's excellent answer for details.Idem
I don't understand why there are so many collisions -- the UUID4 implementation is -- literally -- os.urandom(16). github.com/python/cpython/blob/3.12/Lib/uuid.py#L723Kraft
H
44

You can try the shortuuid library.

Install with : pip install shortuuid

Then it is as simple as :

> import shortuuid
> shortuuid.uuid()
'vytxeTZskVKR7C7WgdSP3d'
Hardfeatured answered 28/4, 2019 at 11:59 Comment(5)
tested shortuuid.uuid()[:8] with 100 million draws and got 0 collisions. but there is a speed sacrifice it seemsIlldisposed
and just for fun. 1 billion draws resulted in 55 collisionsIlldisposed
This should be the selected answer in my view!Marla
I'm guessing this is a Base64 encoded UUID with the padding = stripped off?Soporific
I see they're actually using Base57 without any of the non alphanumerics in Base64. Neat!Soporific
T
29

Is there a reason you can't use tempfile to generate the names?

Functions like mkstemp and NamedTemporaryFile are absolutely guaranteed to give you unique names; nothing based on random bytes is going to give you that.

If for some reason you don't actually want the file created yet (e.g., you're generating filenames to be used on some remote server or something), you can't be perfectly safe, but mktemp is still safer than random names.

Or just keep a 48-bit counter stored in some "global enough" location, so you guarantee going through the full cycle of names before a collision, and you also guarantee knowing when a collision is going to happen.

They're all safer, and simpler, and much more efficient than reading urandom and doing an md5.

If you really do want to generate random names, ''.join(random.choice(my_charset) for _ in range(8)) is also going to be simpler than what you're doing, and more efficient. Even urlsafe_b64encode(os.urandom(6)) is just as random as the MD5 hash, and simpler and more efficient.

The only benefit of the cryptographic randomness and/or cryptographic hash function is in avoiding predictability. If that's not an issue for you, why pay for it? And if you do need to avoid predictability, you almost certainly need to avoid races and other much simpler attacks, so avoiding mkstemp or NamedTemporaryFile is a very bad idea.

Not to mention that, as Root points out in a comment, if you need security, MD5 doesn't actually provide it.

Threadfin answered 21/11, 2012 at 1:14 Comment(11)
I am renaming files being uploaded to the same place, so there is no use of tempfile. Uuid4 is unique enough for anything, but too long. I'm pretty sure 8 chars should be enough to store thousands of names without collision, but I am not sure how to do that. By hashing urandom, I was trying to achieve better randomness.Osmose
@Osmose A UUID is 'too long' by what standard?I
@NickJohnson it's too long to be pretty:) Long filenames look ugly in listings. As I wrote before, 8 chars should by enough to store tens of hundreds random unique names without probable collision, I just don't know what's the best way to increase the improbability.Osmose
@Osmose If you want short, you should use a counter. By the birthday paradox, a randomly generated ID has to be at least twice as long as a sequential ID, given the same number of identifiers.I
according to my testing, ''.join([random.choice(string.ascii_letters+string.digits+'-_') for ch in range(8)]) gives no collision in millions of strings and is also far more efficient than encoding urandom(). I decided to go this way.Osmose
@zahory: tempfile.mktemp can still generate guaranteed-unique temporary filenames without creating the files. Sure, it isn't as secure as mkstemp, but the same is true for a random generator, and of the two, mktemp is absolutely safe instead of just "pretty safe, hopefully safe enough" from non-malicious collisions. It's also simpler and probably more efficient. So I'd still look into whether you can use it, or something similar. I'd also consider the counter before going to random as a last-ditch fallback.Threadfin
@zahory: You don't need the brackets you added to my answer unless you're using Python 2.3 or earlier. (random.choice(my_charset) for ch in range(8)) is a generator expression, which is a perfectly valid sequence to pass to join. Turning it into a list comprehension forces Python to build a list out of the expression (which isn't going to hurt too badly when you've only got 8 elements, but still, why add even a trivial bit of extra complexity for both yourself and the interpreter if you don't need it?).Threadfin
@Threadfin interestigly enough, you are right with the redundant brackets! I understand the syntax of generator expressions, but havent seen it inside a function call, where the parantheses are for... calling the function. You are actually omitting the syntax needed to construct a generator, and it works.Osmose
@Threadfin thank you for the idea of mktemp. I will have to investigate how efficient it is and what is their method of generating names (does it touch the filesystem, I wonder?).Osmose
@zahory: Actually, the parentheses are not part of the syntax of generator expressions. As with tuples, they're only needed when it would otherwise be ambiguous. The paradigm example in the docs (docs.python.org/3/glossary.html#term-generator-expression) is sum(i*i for i in range(10)). I believe the exact rule for genexps is that if it's already inside parens and doesn't have a comma on either side, it's not ambiguous. (Of course sometimes it may look ambiguous to humans, even if the parser has no problem, in which case it's always safe to add them for clarify.)Threadfin
mktemp only guarantees that the file is unique on one file system. It won't guarantee that the filename on the upload server will be unique. Furthermore, once the file is removed from the local system, the filename is free to be reused.Unearned
B
8

From Python 3.6 you should probably use the secrets module. secrets.token_urlsafe() seems to work for your case just fine, and it is guaranteed to use cryptographically safe random sources.

Bedstead answered 28/9, 2020 at 12:10 Comment(0)
A
7

Fastest Deterministic Method

import random
import binascii
e = random.Random(seed)
binascii.b2a_base64(e.getrandbits(48).to_bytes(6, 'little'), newline=False)

Fastest System Random Method

import os
import binascii
binascii.b2a_base64(os.urandom(6), newline=False)

Url Safe Methods

Use os.urandom

import os
import base64
base64.urlsafe_b64encode(os.urandom(6)).decode()

Use random.Random.choices (slow, but flexible)

import random
import string
alphabet = string.ascii_letters + string.digits + '-_'
''.join(random.choices(alphabet, k=8))

Use random.Random.getrandbits (faster than random.Random.randbytes)

import random
import base64
base64.urlsafe_b64encode(random.getrandbits(48).to_bytes(6, 'little')).decode()

Use random.Random.randbytes (python >= 3.9)

import random
import base64
base64.urlsafe_b64encode(random.randbytes(6)).decode()

Use random.SystemRandom.randbytes (python >= 3.9)

import random
import base64
e = random.SystemRandom()
base64.urlsafe_b64encode(e.randbytes(6)).decode()

random.SystemRandom.getrandbits is not recommended if python >= 3.9, since it takes 2.5x time comparing to random.SystemRandom.randbytes and is more complicated.

Use secrets.token_bytes (python >= 3.6)

import secrets
import base64
base64.urlsafe_b64encode(secrets.token_bytes(6)).decode()

Use secrets.token_urlsafe (python >= 3.6)

import secrets
secrets.token_urlsafe(6) # 6 byte base64 has 8 char

Further Discussion

secrets.token_urlsafe implementation in python3.9

tok = token_bytes(nbytes)
base64.urlsafe_b64encode(tok).rstrip(b'=').decode('ascii')

Since ASCII bytes .decode() is faster than .decode('ascii'), and .rstrip(b'=') is useless when nbytes % 6 == 0.

base64.urlsafe_b64encode(secrets.token_bytes(nbytes)).decode() is faster (~20%).

On Windows10, bytes based method is 2x faster when nbytes=6(8 char), and 5x faster when nbytes=24(32 char).

On Windows 10(my laptop), secrets.token_bytes take similar time like random.Random.randbytes, and base64.urlsafe_b64encode take more time than random bytes generation.

On Ubuntu 20.04(my cloud server, may lack entropy), secrets.token_bytes take 15x more time than random.Random.randbytes, but take similar time like random.SystemRandom.randbytes

Since secrets.token_bytes use random.SystemRandom.randbytes use os.urandom (thus they are exactly same), you may replace secrets.token_bytes by os.urandom if performance is crucial.

In Python3.9, base64.urlsafe_b64encode is a combination of base64.b64encode and bytes.translate, thus take ~30% more time.

random.Random.randbytes(n) is implemented by random.Random.getrandbits(n * 8).to_bytes(n, 'little'), thus 3x slower. (However, random.SystemRandom.getrandbits is implemented with random.SystemRandom.randbytes)

base64.b32encode is dramatically slower(5x for 6 bytes, 17x for 480 bytes) than base64.b64encode because there are a lots of python code in base64.b32encode, but base64.b64encode just call binascii.b2a_base64 (C implemented).

However, there is a python branch statement if altchars is not None: in base64.b64encode, which will introduce not negligible overhead when process small data, binascii.b2a_base64(data, newline=False) may be better.

Algia answered 20/11, 2021 at 12:29 Comment(0)
M
3

I am using hashids to convert a timestamp into a unique id. (You can even convert it back to a timestamp if you want).

The drawback with this is if you create ids too fast, you will get a duplicate. But, if you are generating them with time in-between, then this is an option.

Here is an example:

from hashids import Hashids
from datetime import datetime
hashids = Hashids(salt = "lorem ipsum dolor sit amet", alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890")
print(hashids.encode(int(datetime.today().timestamp()))) #'QJW60PJ1' when I ran it
Michelsen answered 24/10, 2018 at 19:58 Comment(0)
L
1

You can try this

import random
uid_chars = ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u',
             'v', 'w', 'x', 'y', 'z','1','2','3','4','5','6','7','8','9','0')
uid_length=8
def short_uid():
    count=len(uid_chars)-1
    c=''
    for i in range(0,uid_length):
        c+=uid_chars[random.randint(0,count)]
    return c

eg:

print short_uid()
nogbomcv
Lyophilize answered 17/6, 2017 at 12:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.