C++ efficient and compact map with integer keys
Asked Answered
O

4

6

I have 200 sets of about 50,000 unique integers in the range 0 to 500,000 I need to map to another small value (pair of ints, values are unrelated so no on-demand calculation).

I tried using std::unordered_maps, and this used around 50MB (measured in VS2015 heap diagnostics tool), and while performance was fine Id would like to get this memory usage down (intending to be a background service on some small 500MB cloud servers).

Effectively my initial version was 200 separate std::unordered_map<int, std::pair<int, int>>.

One option seems to be a sorted array and use binary search, but is there anything else?

Ordinarily answered 30/7, 2016 at 16:33 Comment(10)
Is each of the 200 "sets" its own unique map?Fence
Did you try std::map?Maliamalice
@Maliamalice neither as space-efficient , and especially not as performant, as std::unordered_map for this case. I'm more curious about whether there was any tuning of the bucket size.Fence
200 * 50000 * 4 bytes for an integer comes to 40 megabytes. So at 50 megabytes including the mapped-value, I would say you're doing pretty well.Joaquinajoash
Assuming the sets are dynamic (as they must be for the sorted array to work well), I'd consider the technique I outlined in an older answer: https://mcmap.net/q/1915294/-a-space-efficient-data-structure-to-store-and-look-up-through-a-large-set-of-uniformly-distributed-integersMapp
@Fence That's interesting most references suggest std::map should consume less memory than std::unordered_map but I've not done any tests myself. I would think complexity would be similar to a sorted array but without the need to manually implement it. Though the array should be faster.Maliamalice
I didn't quite well understand what serves as a key in your mapping - an integer or the set of integers? I am asking this because the reported memory usage of 50MB seems to exclude the int -> std::pair<int, int> mapping (for a total of 10M=200*50,000 elements, especially when using an std::unordered_map), while the title suggests that the key is an integer.Gonsalez
What would tuning of the bucket size entail? I assumed the standard implementation was fairly efficient for an unknown set of values?Ordinarily
It was 200 separate maps with a single int key, I added the definition to the question.Ordinarily
Van Emde Boas tree maybe?Selfrising
I
1

I think sorted vector should work, if you won't change the vector once it's sorted. It's really space-efficient, i.e. no pointer overhead.

If you need even better performance, and don't mind some third-party library. You can try sparse_hash_map, which implement hash map with very little space overhead.

Immerge answered 30/7, 2016 at 16:52 Comment(0)
Z
1

I guess that the most memory efficient will be an std::vector<std::pair<int, std::set<Something>>>, like you already suggested.

In this case, you will only have memory overhead as a result of:

  • The fixed overhead from std::vector (Very limited)
  • Sometimes a higher memory usage during the 'grow' as the old data and the new one have to be alive at that moment
  • The unused space in std::vector

You kinda indicate that after the build-up you no longer have to extend the vector, so either you can reserve or shrink_to_fit to get rid of the unused space. (Note that reserve also fixes the spikes in memory usage during grow)

If you would have a denser usage, you could consider changing the storage to std::vector<std::set<Something>> or std::vector<std::unique_ptr<std::set<Something>>>. In this structure, the index is implicit, though the memory gain will only show if you would have a value for every index.

The disadvantage of using a vector is that you have to write some custom code. In that case, std::unordered_map and std::map ain't that bad if you don't mind more cache misses on the processor caches (L1 ...) for less standard implementations, one could check out Googles sparsehash, Googles cpp-btree or Facebooks AtomicHashMap from Folly, though I don't have any experience with it.

Finally, one could wonder why you have this data all in memory, though I don't see a way to prevent this if you need optimal performance.

Zimbabwe answered 30/7, 2016 at 17:38 Comment(5)
I don't understand how the set::set thing would work. What does Something look like? As for custom code with a sorted array, was planning to just use std::sort (after creation) and std::lower_bound (lookup).Ordinarily
Unless you meant Something is the value and the array index is the key? Well like I said the data is 50,000 numbers from 0 to 500,000, so using an array like that is only about 10% efficient. Also sizeof(unique_ptr) will be just as large as 2 ints on 64-bit platforms, although I think I could have a "invalid value" for them instead (INT_MAX perhaps).Ordinarily
Indeed, it is representing some storage, as I wasn't sure about your representation. (Or the next reader of this thread)Zimbabwe
Indeed, 64bit makes the trick with the pointers or the empty set even more useless (if you don't plan on using int64_t). Though if your original situation would require this, it might be handy.Zimbabwe
Please also note that the set also can be changed by a vector.Zimbabwe
S
1

For efficient storage, depending on the precise value range, you may want to use bit operations to store the key/value pairs in a single value: For example, if the values are really small, you could even use 24bit for the keys and 8 bits for the values, resulting in a single 32 bit entry. I believe most compilers nowadays use 32 or 64 bit alignments, so storing for example 32bit keys and 16bit values may still require 64bit per entry. Using simple compression can also be beneficial for performance if the bottleneck is the memory bus and cache misses, rather than the CPU itself.

Then it depends on the kind of operations you would like to perform. The simplest way to store the keys would be a sorted array of structs or the combined ley/value entry that I proposed above. This is fast and very space efficient, but requires O(log n) lookup.

If you want to be a bit more fancy, you could use perfect hashing, the idea is to find a hash function that produces unique hash values for each key. This allows the hashmap to be a simple array which needs to be only marginally larger than the sorted array that I proposed above. Finding a good hash function should be relatively quick, you can make it even easier by making the array a little bit larger and allowing for some unused fields in the array. Here is an implementation of perfect hashing, but I haven't used it myself.

In both cases the memory consumption would be: (number of pairs) * (bits per entry) bit, plus storing the hash function when you use the second approach.

** EDIT **

Updated after comment from @FireLancer. Also, added some words about performance of compressed arrays.

Shinbone answered 31/7, 2016 at 14:45 Comment(2)
I am not seeing how a bit operation in your first example would help here. Id expect a struct Value { int x; int y; } to get stored as 8 continuous bytes anyway. Maybe key+value_1+value_2 could be say 8 bytes rather than 12 though, will need to see if can restrict the range of values enough. The possibility of constructing a better hash function at run-time does looking interesting however, will experiment to see how dense it is with my data sets.Ordinarily
@FireLancer You are right, in C/C++ the bit ops would only help if you don't want to use non-standard bitwidth per key/value (I was thinking in Java). I'll update the answer.Shinbone
E
0

I'll assume you build up your map once and then access it repeatedly, since you mention using a sorted array which is effective in that scenario.

Since your key density is 0.1 (50k values in 500k domain) you could represent the key set with a 500k-entry bitmap which would add only 1.25 bytes per entry or about a 10% overhead (500k bits versus 50k * 12 bytes of payload).

The problem is to access a given entry by key is then a linear operation. So you have another array key_to_slot which allows you to map from an key i to a slot in the values array. This array is sparse, however, so it only has entries for say every 64th possible key in the domain.

This adds another 0.625 bytes per entry, so total overhead is still < 2 bytes per entry.

So something like this:

template <typename Value>
struct bitmap_map {
  using Key = int;

  // key_bitset[i] is true iff key i is in the map
  bitset key_bitset; 
  std::array<Value, 50000> values;
  std::array<int, 500000 / 64> key_to_slot;

  bool contains(Key i) {
    return key_bitset[i];
  }

  Value at(Key i) {
    int base_idx = key_to_slot(i / 64);
    int adjusted_index = base_idx + key_bitset.aligned_popcnt(i)
    return values[adjusted_index];
  }
};

bitset::aligned_popcnt(i) is the number of set bits between index i / 64 * 64 (i.e., i aligned down to a 64 boundary) and i, and is basically 1 memory access and one popcnt instruction.

If you want to dynamically mutate the map in-between lookups things get more complicated since additions or removals to the map would require linear work (shifting up to all values and updating the key_to_slot array. One way around that is to have each 64 slice of the domain just have a separately allocated array, pointed to by key_to_slot, so the amount of shifting required is small and fixed and key_to_slot never needs linear fixups. This bears some resemblance to std::deque.

Eutherian answered 14/12, 2023 at 20:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.