Memory efficient multivaluemap
Asked Answered
I

5

6

Hi I have the following problem: I'm storing strings and a corresponding list of integer values in an MultiValueMap<String, Integer> I'm storing about 13 000 000 million strings and one string can have up to 500 or more values. For every single value i will have random access on the Map. So worst case are 13 000 000* 500 put calls. Now the speed of the map is good but the memory overhead gets quite high. A MultiValueMap<String, Integer> is nothing else then a HashMap/TreeMap<String, <ArrayList<Integer>>. Both HashMap and TreeMap have quite a lot of memory Overhead. I wont be modifying the map once it is done, but I need it to be fast and as small as possible for random access in a program. (I'm storing it on disk and loading it on start, the serialized map file takes up about 600mb but in memory its about 3gb?)

the most memory efficient thing would be, to store the String in sorted string array and have a corresponding two dimensional int array for values. So access would be a binary search on the string array and getting the corresponding values.

Now I have three ways to get there:

  1. I use a sorted MultivalueMap (TreeMap) for the creation phase to store everything.After I'm finished with getting all values, I get the string array by calling map.keyset().toArray(new String[0]); Make a two dimensional int array and get all the values from the multivaluemap. Pro: It's easy to implement, It is still fast during creation. Con: It takes up even more memory during the copying from Map to Arrays.

  2. I use Arrays or maybe ArrayLists from the start and store everything in there Pro: least memory overhead. Con: this would be enormously slow because i would have to sort/copy the Array every time a add a new Key, Also i would need to implement my own (propably even slower) sorting to keep the corresponding int array in the same order like the strings. Hard to implement

  3. I use Arrays and a MultivalueMap as buffer. After the program finished 10% or 20% of the creation phase, I will add the values to the Arrays and keep them in order, then start a new Map. Pro: Propably still fast enough and memory efficient enough. Con: Hard to implement.

None of these solutions really feel right to me. Do you know any other solutions to this problem, maybe a memory efficient (MultiValue)Map implementation?

I know I could be using a database so don't bother posting it as an answer. I want to know how i could do this without using a database.

Instal answered 16/2, 2012 at 21:27 Comment(12)
Quick question: 500 * 4 * 13,000,000 is 26,000,000,000 bytes or +/- 24GB - are you considering storing this data off-heap?Tight
Hi 500 is a worst case estimate most of the strings will have 1 or 2 values, only. Right now I'm running the program with -Xmx12g but I'm storing additional values in another Map. As I sad the the Map takes up around 3g in memory and around 644mb on disk.Instal
Sry I didn't get the off-Heap storing, I just googled it, it sounds interesting.Instal
Do you need to create all the entries simultaneously, or can you build them up gradually. The latter would let you use a simple solution during the creation phase of each entry/row but would have the full memory overhead as you could convert each row to a sorted array once it was complete. Hard to know without more details...Dramatics
I would really follow the database approach, maybe HSQL (in-memory), or maybe BerkeleyDB.Gazpacho
What kind of keys are you using? Is there any kind of pattern? All the same length? Common roots etc.? I'm guessing it's the keys that make up most of the space, am I right? I'm thinking tries.Overtone
@DNA, I will get one key-value pair at a time during creation this gradually builds up the list associated with each key. I only keep distinct values in the lists, but count how often i encountered them in an separate Map.Instal
you said you wouldn't be modifying the map once "it" is done, and then you say you don't like ArrayLists because of key insertion performance. I don't get it (you can sort before adding to the array). Also, how much does disk cost matter to you compared to speed, amount of work to put in, etc..? Why no DB?Supererogatory
the Keys are anchor texts extracted from a wikipedia dump, the integer values are ids of articles that they link to. So ambiguous anchor texts have a lot of values unambiguous long ones just one or few.Instal
@UmNyobe, right now the creation of the map takes with other factors around 2-4 hours. This time can go up (8 hours still would be ok, but with sorting I fear it would increase too much). After it is created i store it on disk for later use. Why no DB? Well: I'm not experienced with them and I don't have the time right now to learn it , I think having everything in memory which is already possible is faster, I just want to reduce the memory consumption so i can run it on another machine that has less memory then the one right now.. thats why switching from maps to arrays /arraylistInstal
@Overtone Yes there are a lot of patterns and a lot of them have common roots so a trie would probably be very effective. Do you know good free trie implementations for java?Instal
@user1083775 I'll post one tomorrow. See my initial answer below.Overtone
A
5

If you switched to Guava's Multimap -- I have no idea if that's possible for your application -- you might be able to use Trove and get

ListMultimap<String, Integer> multimap = Multimaps.newListMultimap(
  new HashMap<String, Collection<Integer>>(),
  new Supplier<List<Integer>>() {
    public List<Integer> get() {
      return new TIntListDecorator();
    }
  });

which will make a ListMultimap that uses a HashMap to map to List values backed by int[] arrays, which should be memory-efficient, though you'll pay a small speed penalty because of boxing. You might be able to do something similar for MultiValueMap, though I have no idea what library that's from.

Apennines answered 16/2, 2012 at 22:22 Comment(0)
H
2

Depending on which Integer values you store in your map, a large amount of your heap memory overhead may be caused by having distinct Integer instances, which take up much more RAM than a primitive int value.

Consider using a Map from String to one of the many IntArrayList implementations floating around (e.g. in Colt or in Primitive Collections for Java), which basically implement a List backed by an int array, instead of a being backed by an array of Integer instances.

Hateful answered 16/2, 2012 at 21:52 Comment(0)
F
2

You can use compressed strings to reduce drastically the memory usage.

Furthermore, there are other more drastic solutions (it would require some reimplementation):

Fsh answered 16/2, 2012 at 22:20 Comment(4)
Hi thanks for this hint didn't know that option, i will try it out and check memory gains.Instal
@user1083775 Great! It would be nice to post the results to us, then we can see how is it with and without the parameter. Also, I've updated my answer with more options, despite the fact that I guess that those wouldn't help that much, but who knows it can help in some way else...Fsh
I tried the parameter, Loading the Map from disk to memory with:-Xmx16g :2,939.7998428344mb with -Xmx16g -XX:+UseCompressedStrings: 2,715.9821014404 did two runs always same result so improves a little i guess the values or index struktures take most of the memoryInstal
@user1083775 Are you running the application in a 64-bit environment? If so, you could try the -XX:+UseCompressedOops, take a look at Java HotSpotVM Options.Fsh
C
2

First, consider the memory taken by the integers. You said that the range will be about 0-4000000. 24 bits is enough to represent 16777216 distinct values. If that is acceptable, you could use byte arrays for the integers, with 3 bytes per integer, and save 25%. You would have to index into the array something like this:

int getPackedInt(byte[] array, int index) {
  int i = index*3;
  return ((array[i] & 0xFF)<<16) + ((array[i+1] & 0xFF) <<8) + (array[i+2] & 0xFF);
}
int storePackedInt(byte[] array, int index, int value) {
  assert value >= 0 && value <= 0xFFFFFF;
  int i = index*3;
  array[i]   = (byte)((value>>16) & 0xFF);
  array[i+1] = (byte)((value>>8)  & 0xFF);
  array[i+2] = (byte)(value & 0xFF);
}

Can you say anything about the distribution of the integers? If many of them will fit in 16 bits, you could use an encoding with a variable number of bytes per number (something like UTF-8 does for representing characters).

Next, consider whether you can save memory on the Strings. What are the characteristics of the Strings? How long will they typically be? Will many strings share prefixes? A compression scheme tailored to the characteristics of your application could save a lot of space (as falsarella pointed out). OR, if many strings will share prefixes, storing them in some type of search trie could be more efficient. (There is a type of trie called "patricia" which might be suitable for this application.) As a bonus, note that searching for Strings in a trie can be faster than searching a hash map (though you'd have to benchmark to see if that is true in your application).

Will the Strings all be ASCII? If so, 50% of the memory used for Strings will be wasted, as a Java char is 16 bits. Again, in this case, you could consider using byte arrays.

If you only need to look Strings up, not iterate over the stored Strings, you could also consider something rather unconventional: hash the Strings, and keep only the hash. Since different String can hash to the same value, there is a chance that a String which was never stored, may still be "found" by a search. But if you use enough bits for the hash value (and a good hash function), you can make that chance so infinitesimally small that it will almost certainly never happen in the estimated lifespan of the universe.

Finally, there is the memory for the structure itself, which holds the Strings and integers. I already suggested using a trie, but if you decide not to do that, nothing will use less memory than parallel arrays -- one sorted array of Strings (which you can do binary search on, as you said), and a parallel array of arrays of integers. After you do a binary search to find an index into the String array, you can use the same index to access the array-of-integer array.

While you are building the structure, if you do decide that a search trie is a good choice, I would just use that directly. Otherwise, you could do 2 passes: one to build up a set of strings (then put them into an array and sort them), and a second pass to add the arrays of integers.

Conney answered 16/2, 2012 at 22:31 Comment(1)
the range is 0- 4 000 000 (currently 3,9 million; nr of artikels in wikipedia can grow). The strings are anchor texts from wikipedia. someone suggested a trie in the comments. I will look into that.Instal
O
2

If there are patterns to your key strings, especially common roots, then a a Trie could be an effective method of storing significantly less data.

Here's the code for a working TrieMap.

NB: The usual advice on using EntrySet to iterate across Maps does not apply to Tries. They are exceptionally inefficient in a Trie so please avoid requesting one if at all possible.

/**
 * Implementation of a Trie structure.
 * 
 * A Trie is a compact form of tree that takes advantage of common prefixes
 * to the keys. 
 * 
 * A normal HashSet will take the key and compute a hash from it, this hash will
 * be used to locate the value through various methods but usually some kind
 * of bucket system is used. The memory footprint resulting becomes something
 * like O(n).
 * 
 * A Trie structure essentuially combines all common prefixes into a single key.
 * For example, holding the strings A, AB, ABC and ABCD will only take enough 
 * space to record the presence of ABCD. The presence of the others will be 
 * recorded as flags within the record of ABCD structure at zero cost.
 * 
 * This structure is useful for holding similar strings such as product IDs or
 * credit card numbers.
 *  
 */
public class TrieMap<V> extends AbstractMap<String, V> implements Map<String, V> {


  /**
   * Map each character to a sub-trie.
   * 
   * Could replace this with a 256 entry array of Tries but this will handle
   * multibyte character sets and I can discard empty maps. 
   * 
   * Maintained at null until needed (for better memory footprint).
   * 
   */
  protected Map<Character, TrieMap<V>> children = null;

  /**
   * Here we store the map contents.
   */
  protected V leaf = null;

  /**
   * Set the leaf value to a new setting and return the old one.
   * 
   * @param newValue
   * @return old value of leaf.
   */
  protected V setLeaf(V newValue) {
    V old = leaf;
    leaf = newValue;
    return old;
  }

  /**
   * I've always wanted to name a method something like this.
   */
  protected void makeChildren () {
    if ( children == null ) {
      // Use a TreeMap to ensure sorted iteration.
      children = new TreeMap<Character, TrieMap<V>>();
    }
  }

  /**
   * Finds the TrieMap that "should" contain the key.
   * 
   * @param key 
   * 
   * The key to find.
   * 
   * @param grow 
   * 
   * Set to true to grow the Trie to fit the key.
   * 
   * @return 
   * 
   * The sub Trie that "should" contain the key or null if key was not found and
   * grow was false.
   */
  protected TrieMap<V> find(String key, boolean grow) {
    if (key.length() == 0) {
      // Found it!
      return this;
    } else {
      // Not at end of string.
      if (grow) {
        // Grow the tree.
        makeChildren();
      }
      if (children != null) {
        // Ask the kids.
        char ch = key.charAt(0);
        TrieMap<V> child = children.get(ch);
        if (child == null && grow) {
          // Make the child.
          child = new TrieMap<V>();
          // Store the child.
          children.put(ch, child);
        }
        if (child != null) {
          // Find it in the child.
          return child.find(tail(key), grow);
        }
      }
    }
    return null;

  }

  /**
   * Remove the head (first character) from the string.
   * 
   * @param s
   * 
   * The string.
   * 
   * @return 
   * 
   * The same string without the first (head) character.
   * 
   */
  // Suppress warnings over taking a subsequence
  private String tail(String s) {
    return s.substring(1, s.length());
  }

  /**
   * 
   * Add a new value to the map.
   * 
   * Time footprint = O(s.length).
   * 
   * @param s
   * 
   * The key defining the place to add.
   * 
   * @param value
   * 
   * The value to add there.
   * 
   * @return
   * 
   * The value that was there, or null if it wasn't.
   * 
   */
  @Override
  public V put(String key, V value) {
    V old = null;

    // If empty string.
    if (key.length() == 0) {
      old = setLeaf(value);
    } else {
      // Find it.
      old = find(key, true).put("", value);
    }

    return old;
  }

  /**
   * Gets the value at the specified key position.
   * 
   * @param o
   * 
   * The key to the location.
   * 
   * @return 
   * 
   * The value at that location, or null if there is no value at that location.
   */
  @Override
  public V get(Object o) {
    V got = null;
    if (o != null) {
      String key = (String) o;
      TrieMap<V> it = find(key, false);
      if (it != null) {
        got = it.leaf;
      }
    } else {
      throw new NullPointerException("Nulls not allowed.");
    }
    return got;
  }

  /**
   * Remove the value at the specified location.
   * 
   * @param o
   * 
   * The key to the location.
   * 
   * @return 
   * 
   * The value that was removed, or null if there was no value at that location.
   */
  @Override
  public V remove(Object o) {
    V old = null;
    if (o != null) {
      String key = (String) o;
      if (key.length() == 0) {
        // Its me!
        old = leaf;
        leaf = null;
      } else {
        TrieMap<V> it = find(key, false);
        if (it != null) {
          old = it.remove("");
        }
      }
    } else {
      throw new NullPointerException("Nulls not allowed.");
    }
    return old;
  }

  /**
   * Count the number of values in the structure.
   * 
   * @return 
   * 
   * The number of values in the structure.
   */
  @Override
  public int size() {
    // If I am a leaf then size increases by 1.
    int size = leaf != null ? 1 : 0;
    if (children != null) {
      // Add sizes of all my children.
      for (Character c : children.keySet()) {
        size += children.get(c).size();
      }
    }
    return size;
  }

  /**
   * Is the tree empty?
   * 
   * @return 
   * 
   * true if the tree is empty.
   * false if there is still at least one value in the tree.
   */
  @Override
  public boolean isEmpty() {
    // I am empty if I am not a leaf and I have no children 
    // (slightly quicker than the AbstaractCollection implementation).
    return leaf == null && (children == null || children.isEmpty());
  }

  /**
   * Returns all keys as a Set.
   * 
   * @return 
   * 
   * A HashSet of all keys.
   * 
   * Note: Although it returns Set<S> it is actually a Set<String> that has been
   * home-grown because the original keys are not stored in the structure 
   * anywhere.
   */
  @Override
  public Set<String> keySet() {
    // Roll them a temporary list and give them a Set from it.
    return new HashSet<String>(keyList());
  }

  /**
   * List all my keys.
   * 
   * @return 
   * 
   * An ArrayList of all keys in the tree.
   * 
   * Note: Although it returns List<S> it is actually a List<String> that has been
   * home-grown because the original keys are not stored in the structure 
   * anywhere.
   * 
   */
  protected List<String> keyList() {
    List<String> contents = new ArrayList<String>();

    if (leaf != null) {
      // If I am a leaf, a null string is in the set.
      contents.add((String) "");
    }

    // Add all sub-tries.
    if (children != null) {
      for (Character c : children.keySet()) {
        TrieMap<V> child = children.get(c);
        List<String> childContents = child.keyList();
        for (String subString : childContents) {
          // All possible substrings can be prepended with this character.
          contents.add((String) (c + subString.toString()));
        }
      }
    }

    return contents;
  }

  /**
   * Does the map contain the specified key.
   * 
   * @param key
   * 
   * The key to look for.
   * 
   * @return 
   * 
   * true if the key is in the Map.
   * false if not.
   */
  public boolean containsKey(String key) {
    TrieMap<V> it = find(key, false);
    if (it != null) {
      return it.leaf != null;
    }
    return false;
  }

  /**
   * Represent me as a list.
   * 
   * @return 
   * 
   * A String representation of the tree.
   */
  @Override
  public String toString() {
    List<String> list = keyList();
    //Collections.sort((List<String>)list);
    StringBuilder sb = new StringBuilder();
    Separator comma = new Separator(",");
    sb.append("{");
    for (String s : list) {
      sb.append(comma.sep()).append(s).append("=").append(get(s));
    }
    sb.append("}");
    return sb.toString();
  }

  /**
   * Clear down completely.
   */
  @Override
  public void clear() {
    children = null;
    leaf = null;
  }

  /**
   * Return a list of key/value pairs.
   * 
   * @return 
   * 
   * The entry set.
   */
  public Set<Map.Entry<String, V>> entrySet() {
    Set<Map.Entry<String, V>> entries = new HashSet<Map.Entry<String, V>>();
    List<String> keys = keyList();
    for (String key : keys) {
      entries.add(new Entry<String,V>(key, get(key)));
    }
    return entries;
  }

  /**
   * An entry.
   * 
   * @param <S>
   * 
   * The type of the key.
   * 
   * @param <V> 
   * 
   * The type of the value.
   */
  private static class Entry<S, V> implements Map.Entry<S, V> {

    protected S key;
    protected V value;

    public Entry(S key, V value) {
      this.key = key;
      this.value = value;
    }

    public S getKey() {
      return key;
    }

    public V getValue() {
      return value;
    }

    public V setValue(V newValue) {
      V oldValue = value;
      value = newValue;
      return oldValue;
    }

    @Override
    public boolean equals(Object o) {
      if (!(o instanceof TrieMap.Entry)) {
        return false;
      }
      Entry e = (Entry) o;
      return (key == null ? e.getKey() == null : key.equals(e.getKey()))
              && (value == null ? e.getValue() == null : value.equals(e.getValue()));
    }

    @Override
    public int hashCode() {
      int keyHash = (key == null ? 0 : key.hashCode());
      int valueHash = (value == null ? 0 : value.hashCode());
      return keyHash ^ valueHash;
    }

    @Override
    public String toString() {
      return key + "=" + value;
    }
  }
}
Overtone answered 16/2, 2012 at 23:51 Comment(5)
Regarding the hashCode function. I have started making the null-value placeholder non-zero, and different for each field. That's because in the case that the Key and Value types are the same, I want (null, value) to hash to a different number than (value, null), and for both of them to hash to a different value than (null, null).Kemme
@Kemme You can't really do it when implementing Map. While the hashCode algorithm is pretty stupid (especially when key==value), it's prescribed.Jasun
Oh yeah, this is a Map implementation. I forgot that the key can’t be null. But what about in the general case? If you’re not implementing an interface that you want to maintain quality comparison with other implementations of, there’s. I thing stopping you, right?Kemme
@Kemme That's what I meant: docs.oracle.com/javase/8/docs/api/java/util/…. In general, null is allowed in a Map, but implementations are free to forbid it. I'd use non-zero for null and e.g., keyHash ^ 5*valueHash, if I could, but it's carved in stone.Jasun
Sorry about typos in last comment!Kemme

© 2022 - 2025 — McMap. All rights reserved.