Shortening/Rehashing UUIDs
Asked Answered
B

4

38

first of all, I want to assure that I'm aware of the fact, that rehashing is a sensible topic. However I'd like to hear some of your opinions, what approach you would take here.

I'm building a distributed application, where nodes remotely create entities identified by a UUID. Eventually, all entities should be gathered at a dedicated drain node, which stores all entities by using these UUIDs.

Now I want to create additional identifiers, which are more handy for human users. Base64-encoding the UUIDs would still create IDs with 22 characters, which is not appropriate for human usage. So I need something like URL-shortening services. Applying bijective functions will not help, because they will not reduce the information value. Of course, I'm aware that I need to lose information in order to shorten the id. And I'm also aware that any reduction of information of a hash will increase the probability of collision. I'm stuck, what is the most appropriate way to reduce information in order to create shorter ids for humans.

Here are some prerequisites: I will provide the ability to map {UUID, shortened ID} via my data storage. I'd still prefer a non-centralized solution. I will probably never ever need more than about a milion of IDs (~2^20) in total.

Here are the thoughts I came up with so far:

  • Auto incremented IDs: If I'd use some kind of auto-incremented id, I could transfer this id to an obfuscated string and pass this around. This would be the easiest approach, and as long as there are few keys around, the keys would not be very long. However I'd have to introduce a centralized entity which I don't really want.
  • Shorten the UUID: I could just take some of the bits of the original 128 bit uuid. Then I should take at least into account the version of the UUID. Or is there anything else wrong with this?
  • Rehashing the UUID: I could apply a second hashing algorithm on my initial UUID and store the mapping.

Are there any other approaches? What is favorable?

Thanks in advance!

Bur answered 12/2, 2010 at 17:19 Comment(0)
A
33

1) To shorten the UUID, you can simply XOR the top half with the bottom (and repeat until it's short enough for you). This will preserve the distribution characteristics. Like any solution that shortens the output, it will increase the possibility of collision due to the birthday paradox

2) XOR amounts to a trivial hash, but since no additional mixing is needed, it's fine. You could use a CRC or noncryptographic hash on your UUID, but I don't believe it's any improvement.

3) If you're willing to accept some central management, it doesn't have to be painful. A central authority can dole out medium-sized blocks of address space to each client, then the client can iterate through that subrange when assigning ID's. This guarantees that there are no collisions, but also avoids a round-trip for each ID. One way to do it would be to use a 32-bit integer for the ID, doling out a 16-bit block at a time. In other words, the first client gets handed 0001, which allows 00010000 to 0001FFFF.

4) You could insert into the database with a UUID, but also have an identity field. This would provide an alternate, more compact unique ID, which can be limited to a 32-bit int.

Ashford answered 12/2, 2010 at 18:5 Comment(6)
@3: I'm bound to UUIDs by the system used on the distributed nodes. And I don't want to add my own IDs again, so I'll stick to UUIDs for my data storage. I just want to provide some 'alias' IDs.Bur
I'll add a (4), but I'm not sure I endorse it.Ashford
@4: I'm planning to use CouchDB, which does not have any auto-incrementing identity features and also uses UUIDs by default. So the additional hash I'm searching for will only be an additional attribute per entry, and will be resolved using a view.Bur
Given this, I don't think (4) works for you. Is (1) good enough? Keep in mind that the birthday paradox says 32 bits gets you less than 64k of non-collision.Ashford
@PartlyCloud - can u please provide some sample code, about how to do this? mainly for #1? please?Rosannrosanna
@Pure: There's not much to this. The main thing is to use Guid.ToByteArray() to get a 16-byte array. Then you can use the ^ operator to XOR bytes together. If you want a 32-bit output, you'll need to combine each group of four input bytes into one output byte. I'd recommend interleaving it so that the first output byte comes from a combo of offsets 0, 4, 8, and 12. And so on.Ashford
T
12

Have you considered using an external aliasing approach, where you pick a dictionary of human friendly terms and use them to make (parts of) the UUID more readable (compare with Geocoding systems such as What3Words):

de305d54-75b4-431b-adb2-eb6b9e546013

Using a dictionary of 65536 words could become:

de305d54-zebra-stackoverflow-extraneous-eb6b9e546013

It is unlikely that users will see mental hash collision (zebra occurring twice) with these human readable names and your database does not grow in size. The translation is bijective and purely UI.

There even is an RFC for this: https://datatracker.ietf.org/doc/html/rfc1751

Tarttan answered 28/1, 2015 at 19:17 Comment(0)
C
4

Just a couple of things that pop into mind:

What is your use case? If your concern is that you will be generating IDs in a distributed manner, one solution is to assign each machine it's own unique int id and use that as a prefix or suffix on its ids.

This doesn't really help if by not having a central entity you mean nothing that keeps track of ids even locally. You could borrow a page from UUID itself and use the system time in conjunction with the machine id assigned as above. This would get you down to 64bits + whatever size your machine id was. Basically, this is the UUID V1 scheme, except you're using something shorter than MAC address for the machine id. Given you know you can start at dates >=Feb 12, 2010, you may be able shorten even further.

Check out the wikipedia UUID entry if you haven't already, you may get an idea or two from there on how to construct your own.

Compulsive answered 12/2, 2010 at 18:17 Comment(2)
Please see my first comment to Steven's answer to see that I'm bound to UUID by the system.Bur
The other thing is that UUID's are usually hashed versions of the values generated by that algorithm.Ashford
A
1

Here is a simple hashing algorithm I wrote. You could use this... you can easily change the input and output mappings, and the length of the hash in order to trade off readability vs collision likelihood.

This algorithm is not designed to be secure or that efficient, but should do the trick.

public class HashTools {

  final static String inputMapping = "0123456789ABCDEF";

  final static String[] outputMapping = new String[] {
      "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F", "G", "H",
      "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"
  };

  /* Input: String - containing mostly letters / numbers
   * Output: <hashLength> String using 0-9,A-Z encoding
   */
  public static String simpleHash(String str, int hashLength) {
    StringBuilder hashStr = new StringBuilder(hashLength);
    String strUpper = str.toUpperCase();
    int[] hash = new int[hashLength];

    int i, j, num;
    for (i = 0; i < strUpper.length(); i++) {
      char strChar = strUpper.charAt(i);
      num = mapCharToInt(strChar);

      j = i % hashLength;
      hash[j] += num;
    }

    for (i = 0; i < hashLength; i++) {
      hashStr.append(mapIntToHashChar(hash[i]));
    }

    return hashStr.toString();
  }

  private static int mapCharToInt(char hexChar) {
    return inputMapping.indexOf(hexChar);
  }

  private static String mapIntToHashChar(int num) {
    return outputMapping[num % outputMapping.length];
  }
}
Athome answered 1/9, 2012 at 14:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.