Hi I have the following problem:
I'm storing strings and a corresponding list of integer values in an MultiValueMap<String, Integer>
I'm storing about 13 000 000 million strings and one string can have up to 500 or more values.
For every single value i will have random access on the Map. So worst case are 13 000 000* 500 put calls. Now the speed of the map is good but the memory overhead gets quite high. A MultiValueMap<String, Integer>
is nothing else then a HashMap/TreeMap<String, <ArrayList<Integer>>
. Both HashMap and TreeMap have quite a lot of memory Overhead. I wont be modifying the map once it is done, but I need it to be fast and as small as possible for random access in a program. (I'm storing it on disk and loading it on start, the serialized map file takes up about 600mb but in memory its about 3gb?)
the most memory efficient thing would be, to store the String in sorted string array and have a corresponding two dimensional int array for values. So access would be a binary search on the string array and getting the corresponding values.
Now I have three ways to get there:
I use a sorted MultivalueMap (TreeMap) for the creation phase to store everything.After I'm finished with getting all values, I get the string array by calling
map.keyset().toArray(new String[0]);
Make a two dimensional int array and get all the values from the multivaluemap. Pro: It's easy to implement, It is still fast during creation. Con: It takes up even more memory during the copying from Map to Arrays.I use Arrays or maybe ArrayLists from the start and store everything in there Pro: least memory overhead. Con: this would be enormously slow because i would have to sort/copy the Array every time a add a new Key, Also i would need to implement my own (propably even slower) sorting to keep the corresponding int array in the same order like the strings. Hard to implement
I use Arrays and a MultivalueMap as buffer. After the program finished 10% or 20% of the creation phase, I will add the values to the Arrays and keep them in order, then start a new Map. Pro: Propably still fast enough and memory efficient enough. Con: Hard to implement.
None of these solutions really feel right to me. Do you know any other solutions to this problem, maybe a memory efficient (MultiValue)Map implementation?
I know I could be using a database so don't bother posting it as an answer. I want to know how i could do this without using a database.