Idiomatic "guaranteed-unique" identifiers in C++
Asked Answered
S

6

5

Is there an idiomatic C++ way to reserve and recycle identifiers that are guaranteed to be unique? My requirements are:

  1. Assuming there exists an ID not currently reserved, reserve_id(void) returns me that ID.
  2. In an unbroken sequence of reserve_id() calls, no single identifier will be returned twice
  3. There exists a function recycle(id_type) that returns an identifier to the available pool.

I have, for instance, seen Boost::Uuid, but a) I see no documentation which asserts the guaranteed uniqueness of two UUIDs and b) I'm constrained to an earlier version of Boost (1.40), for the time being. I could push to upgrade, if this were particularly perfect for the task.

Slipnoose answered 14/4, 2011 at 8:49 Comment(11)
Your requirements aren't clear: as specified, you could sprintf("id%d", ++id) with a static id counter and it'd be good for a couple billion values, or return a std::string after += 'x' ;-). If in practice the values aren't exhausted, recycling could be a no-op, or you could create a deque<> and use it preferentially. What other constraints / expectations do you really have... on length, format, performance, cross-host uniqueness, cross-process-run uniqueness etc.? Or are we supposed to infer requirements akin to Boost::Uuid's purpose?Hindgut
To be clear, I am worried about integer wraparound, as I'm writing a long-running server. My chief concern is #2 above.Slipnoose
You could easily use unsigned 64 bit int, or even chain 2 with a simple if (!++low_order) ++high_order;...? That ought to be ok until the sun burns out....Hindgut
Its worth considering what "extremely unlikely" to duplicate means. wikipedia describes the chance for a 122bit random UUID "In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs." en.wikipedia.org/wiki/Universally_unique_identifier. IMO its more likely for hardware failure of some type to give you a duplicate in whatever carefully crafted code you use.Drive
@Michael Anderson: I seriously hope you are not in charge of designing airplanes, cars or nuclear things :-)Intensify
@Giovanni - I think its more likely that there are bugs in the CPU or compiler or something like that, than a collision happening with 122bits. But hey if you're really scared of such possibilities you can always double or tripple your number of bits... Then you're getting to the "generating billions of ids per nanosecond for the life of the universefor a 50% chance of a collision" kind of scale . - Its all a question of what is acceptable as "close enough to zero possibility". And if you say only true zero, them you're being academic and not practical.Drive
@Michael, you're assuming that you have a good random number generator. I don't know if that is a fair assumption unless you use a special library.Checkrow
@edA-qa mort-ora-y: you can use boost::random_device as the source for boost::uuids::random_generator.Tupper
Okay okay okay! Y'all have convinced me that UUIDs are fine. :) I'll probably do a Patricia Tree kind of thing to shorten them in log files.Slipnoose
It is kind of implicit, but perhaps you can state the requirements to the following: The identifiers have to be unique only for the running process. If you restart the process, it can repeat the identifiers? If two server executables run at the same time, they can generate the same identifiers? The ids do not have to be recycled, they can also be discarded? How many are typically generated? 1 per second? 1 billion per second? Is there a size limit? Is there a multithreading performance requirement (no blocking)?Geraint
Use a large number (256 bits), some bits for thread number; just increment the previous number for each thread with each call. The number is so large that it never overflows. No need for UUID, hashing and similar shenanigans.Geraint
S
3

I think you've already solved this problem for most practical purposes by finding Boost::Uuid, with the exception of your requirement to recycle identifiers already generated.

From the documentation you linked to in the question:

When UUIDs are generated by one of the defined mechanisms, they are either guaranteed to be unique, different from all other generated UUIDs (that is, it has never been generated before and it will never be generated again), or it is extremely likely to be unique (depending on the mechanism).

If you're hell-bent on recycling and re-using existing identifiers, I suppose you could maintain build up a pool of UUIDs over time, generating new ones only when you need one and find that the pool is empty. But I can't imagine a scenario where that would be preferable to generating a new UUID.

EDIT: You've commented that you need a guarantee as to uniqueness. Realistically, you're never going to get one when programatically generating a unique identifier. In practice, you're going to store the generated ID in a data type which has finite size, and so the possible set of IDs you can generate is finite too. IMHO, the best you can achieve then is to simulate uniqueness within a tolerance threshold.

You could do this by

  • Using a technique that makes the chances of getting a duplicate UUID very remote (this is what Boost::UUID will do);

  • Wrapping the generation of the highly-probably-to-be-unique UUID in some other logic that looks up the newly-generated UUID in a list of already-generated UUIDs to eliminate that tiny chance that the new one is a duplicate. Obviously, the practicality of doing this becomes decreases as you approach very large quantities of UUIDs in your list. How many do you anticipate generating?

  • If you want truly huge quantities of unique IDs, bigger than would fit in a native type, you could implement a type that manages the memory and does the necessary maths, and just produce sequential Ids, or you could perhaps use something like the GNU Bignum Library to do it for you.

Setting answered 14/4, 2011 at 8:57 Comment(3)
So "extremely likely" doesn't work for me, in this case. I do want a guarantee.Slipnoose
@Andres Jaan Tack: what kind of guarantee? Do you need a stronger guarantee from your UUID generator that the numbers are unique, than you get from your RAM that it will store them correctly, as opposed to flipping a few bits to change a new unique value to a value that you've seen before? Is this code running on a real machine, or a Platonic ideal of a Turing machine?Tupper
@Setting I like your treatment of this issue. In summary, the right answer is, "Everybody is happy to believe the UUID generator, and everything has been fine that way."Slipnoose
K
3

What sort of uniqueness do you require?
Just unique for the lifetime of the program or unique across multiple runs/cross-process?

If it is the former then you could just new a byte of memory then use the address of that memory as your identifier. This would be guaranteed to be unique until you delete the memory, at which point it may be recycled.

This could easily be wrapped in a class like this:

#include <stdint.h>

class UID
{
public:
        typedef uint64_t id_type;

        static const id_type reserve_id()
        {
                uint8_t* idBlock = new uint8_t;
                *idBlock = validId;
                return (id_type)idBlock;
        }

        static void recycle(id_type id)
        {
                uint8_t* idBlock = (uint8_t*)id;
                if (*idBlock == validId)
                {
                        *idBlock = 0;
                        delete idBlock;
                }
        }
private:
        static const uint8_t validId = 0x1D;
};

Possibly a bit unusual, but it meets your requirements if you only need per-process uniqueness :)

Knockknee answered 14/4, 2011 at 9:36 Comment(0)
A
3

How long do the IDs live? Do you REALLY need to recycle them, or can you live with them being unique forever? How many do you need to generate all at once? How many bits can you devote to the id?

Here's a simple recipe: Take your ethernet card's mac address (which is globally unique baring hardware issues), mix in the time/date (to millisecond resolution) and an incrementing integer counter (increments once per id generated) and you'd have an id that's unique within the span of your time/date range, as long as you don't generate MAXINT of them in a single millisecond on this machine. Now it's NOT random looking, and it's EASY for an attacker to predict, so don't use it for security, and it's sure not the most efficient use of bits out there, but it IS globally unique.

Amalgamation answered 14/4, 2011 at 10:44 Comment(2)
What I've described is not proof against ANY type of deliberate attack. The integer counter will protect against normal clock change issues, unless you roll it over many times per day.Amalgamation
yes, what I said wasn't intended as a criticism. Just something to bear in mind when using the time.Tupper
M
2

Yes, this is simple.

  1. The reserve_id function is operator new(0).
  2. This allocates zero bytes, but with a unique address.
  3. The recycle function is of course operator delete
Miler answered 14/4, 2011 at 10:57 Comment(3)
@Salters True - up to the point where the software is restarted, or you run out of addresses.Setting
isn't that pretty much what I suggested 2 hours before you? (Except I went for a one-byte allocation so I could add some limited validation). Where's my upvote dammit? :)Knockknee
Well, and I didn't bother casting to an integral type. If UUIDs are acceptable, void* probably are too.Miler
I
0

The problem does not seem connected to C++, it is more of a fundamental issue. How many IDs are expected to be valid at any given time? If you expect to have few valid IDs at any given time, just put them in a container such as linked list, vector or set depending on your performance requirements and relative recycle/reserve frequency. A sorted linked list is probably the best option as you will have both recycle and reserve operations in O(n). A vector has O(n), O(n log n) and a set has O(n log n), O(n) respectively (might be wrong, I did the thinking very quicky).

void recycle(ID) {
    container.remove(ID);
    // abort if unsuccessiful (= invalid ID)
}

ID reserve() {
    static ID last = 0;
    while(container.find(last)) {
        last++;
    }
    return last;
}
Intensify answered 14/4, 2011 at 10:11 Comment(0)
O
0

Here's a simplified implementation I use in .NET projects, quickly translated to C++ (may want to add some bounds validation which was implicit in C#):

  • Can rent (T acquire()) and return (void release(T)) successive integral numbers, with an optional start (offset). You can use these as an index or seed to more complex identifiers/structures if needed.
  • Does not require iterations, instead using offset logic in a second index array (duplicating memory requirements).
  • Released numbers are returned on next acquire.
  • Some extra convenience methods like getting the total numbers rented (T size()) or validating them (bool validate(T)).
#include <stdexcept>
#include <vector>

template <class T>
class NumberPool
{
public:
    NumberPool(const T capacity, const T offset = 0)
        : m_capacity{ capacity }
        , m_offset{ offset }
    {
        m_indices.resize(m_capacity);
        m_numbers.resize(m_capacity);
        clear();
    }

    T acquire()
    {
        if (m_size == m_capacity)
            throw std::out_of_range("No more numbers available.");

        T i = m_size++;
        T n = m_numbers[i];
        m_indices[n] = i;

        return n + m_offset;
    }

    T capacity() const
    {
        return m_capacity;
    }

    void clear()
    {
        m_size = 0;
        for (T i = 0; i < m_capacity; ++i)
            m_numbers[i] = i;
    }

    void release(T n)
    {
        n -= m_offset;

        T last = m_numbers[--m_size];
        T i = m_indices[n];
        m_indices[last] = i;
        m_numbers[m_size] = n;
        m_numbers[i] = last;
    }

    T size() const
    {
        return m_size;
    }

    bool validate(T n) const
    {
        n -= m_offset;

        T i = m_indices[n];
        return i < m_size && m_numbers[i] == n;
    }

private:
    std::vector<T> m_indices;
    std::vector<T> m_numbers;
    T m_capacity;
    T m_size;
    T m_offset;
};

Can be used as follows (feel free to make proper asserts in a test framework):

NumberPool<uint16_t> pool{ 5, 1000 };
uint16_t n1000 = pool.acquire();
uint16_t n1001 = pool.acquire();
uint16_t n1002 = pool.acquire();
uint16_t size3 = pool.size();

bool n1002true = pool.validate(n1002);
pool.release(n1002);
bool n1002false = pool.validate(n1002);
uint16_t size2 = pool.size();

uint16_t n1002b = pool.acquire();
uint16_t n1003 = pool.acquire();
uint16_t n1004 = pool.acquire();
uint16_t size5 = pool.size();
bool n1004true = pool.validate(n1004);

pool.clear();
uint16_t size0 = pool.size();

bool n1000false = pool.validate(n1000);
bool n1004false = pool.validate(n1004);

uint16_t n1000b = pool.acquire();
bool n1000btrue = pool.validate(n1000b);
Oren answered 17/5 at 18:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.