Data structure equivalent to map ( in java ) for large datasets
Asked Answered
S

8

7

Is there an already implemented data structure that I can use in order to assign to an object (in my case an Edge), an integer? I am reading a graph from a file , 10 mil vertices , 60 mil edges and I assign to each edge , a cost , using a map ( costs.put(e,cost) ).

I create the costs map in this way :

costs = new HashMap<Edge,Integer>();

The exception that it gives is :

java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.resize(Unknown Source)
    at java.util.HashMap.addEntry(Unknown Source)
    at java.util.HashMap.put(Unknown Source) 
Sohn answered 31/10, 2012 at 9:12 Comment(8)
I don't think I follow your question. Can you elaborate what (and why) you are trying to achieve?Lattimore
Are you looking for an alternative to HashMap that is more memory-efficient?Brenza
Yes, there is, and you are using it. Increase your heap size. If you want to implement super-memory-efficient data structures on your own, I suggest studying Lucene's implementation.Freeze
I wanted to know if there is a data structure that I can use such that it does the same thing like a map does. Assign to an object , an integer in my case.Sohn
If you want to cut memory costs, implement comparable and use a TreeMapBrenza
@JanDvorak Why would that be more memory-efficient? TreeMap's nodes add more overhead than raw arrays in HashMap, and both rely on Map.Entry instances.Freeze
@MarkoTopolnik hash maps allocate a lot of unused space to be efficient. I don't think the larger nodes of a tree map counter that.Brenza
@JanDvorak That depends on the initial capacity and the load factor. If chosen to carefully match the problem, there is no overhead.Freeze
D
6

HashMap is the correct data structure for a basic Map. The problem you are having is that the JVM is not being instructed to reserve enough space to keep the file contents in memory. Start the JVM with a -Xmx flag. For instance -Xmx1G parameter will allow it to use 1 Gigabyte of memory.

Diaster answered 31/10, 2012 at 9:17 Comment(5)
and -XX:MaxPermSize=1000m for increasing your permgen memory limit by 1000 MB.Gainey
@IqbalDjulfri Why would OP waste memory on PermGen?Freeze
Increasing the heap size is the right way forward. In our product we use a custom cache implementation, which uses about 20 GB of heap space - and everything works fine (with the right JVM parameters)Lattimore
@IqbalDjulfri you mean increasing to and PermGen does not affect heap spaceCouch
I'm sorry for causing unnecessary discussion and clearly it doesn't have anything to do with the question, but I just post it alongside Tim's answer about -Xmx since I remember clearly the first time I have memory problem with Java application and it was so hard to find these answers. thank you :)Gainey
F
6

You've got 6e7 edges. A plain Object takes 24 bytes (64-bit HotSpot), so that's 1.44e9 bytes right there (1.5 GB). Now you introduce the most efficient map imaginable, adding only 6e7 references plus 6e7 Integer objects. That's another 2.4e8 bytes for the refs and 1.44e9 bytes for the Integers: another 1.5 GB, the total being now 3 GB—and this is the theoretical lower bound for your problem (modulo caching, see below).

Based on this I suggest you just extend your Edge class with another int field. This will drastically reduce your memory footprint.

If this is not an option, however, and:

  • all your integers rarely exceed two digits,
  • you are careful never to use new Integer, but Integer.valueOf, autoboxing, etc.,
  • you use Java 7,

you'll automatically benefit from the built-in small integer cache. If the integers assume values from a wider range, but still with lot of duplication, a custom cache is highly advisable.

Freeze answered 31/10, 2012 at 9:24 Comment(1)
"A plain Object takes 24 bytes". Please specify on what JVM & architecture.Drosophila
C
3

Additionally to changing the jvms memory settings you can tweak HashMaps memory management with initial capacity and load balance.

Javadoc excerpt:

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

Couch answered 31/10, 2012 at 9:20 Comment(0)
O
3

Instead of creating a HashMap

costs = new HashMap<Edge,Integer>();

You can store cost within Edge object.

class Edge
{
    Vertex a;
    Vertex b;
    int cost;
}

This way you can save some memory in system.

Overgrow answered 31/10, 2012 at 9:21 Comment(4)
Yes , but in this case I have to modify a lot of code in my program.Sohn
I believe you should modify design before implementation. It makes product more scaleable and maintainable.Overgrow
@RaduStejerean, Map is an interface. Implement a custom Map whose get(Edge key) { return key.cost; }Drosophila
I extended the heap size by using "-Xmx1G" in the arguments for run configurations , and it seems to work.Sohn
P
2

Back to the original problem: You have Edges, that have costs. As your graph is sparse, why not use a sparse matrix? Maybe a Object-to-Integer mapping isn't what you really need and want. You can look at apache.commons.math, i think they have sparse matrices. Also, you need to think about how you access the costs in your algorithms, to choose the proper sparse format (column based run-length encoding / row based rle or something else). Or you don't care, and use any, but then you should convert the thing in the beginning of your algorithms.

Plantar answered 31/10, 2012 at 9:22 Comment(1)
I guess the solution to this problem is a clever (math) way how to calculate cost instead of throwing memory at it. +1Couch
P
1

You do realize that this takes a whole lot of RAM, do you? Try increasing the heap size, and you'll be fine...

And to answer your original question: yes, that is what Maps are for...

Paton answered 31/10, 2012 at 9:17 Comment(0)
S
1

You must specify per project how much heap space your project wants

I think you could follow this step:

Right mouse click - Run As - Run Configuration - Arguments - Vm Arguments, then add this -Xmx2048m
Storfer answered 31/10, 2012 at 9:18 Comment(1)
I have tried that , but it didn't work. Gave me the same exception.Sohn
G
1

Perhaps what you are looking for is TObjectIntHashMap This is similar to HashMap<Edge, Integer> except it stores the int as a primitive potentially saving you some memory. This collection can also be marginally faster when the collection is larger (as it fits into cache better)

TObjectIntHashMap<Edge> costs = new TObjectIntHashMap<Edge>();

costs.put(e, cost); // cost is stored as a primitive not its wrapper object.
Gland answered 31/10, 2012 at 9:46 Comment(1)
Thanks , but I managed to increase the heap size and I am not going to change the data structure . At least for now.Sohn

© 2022 - 2024 — McMap. All rights reserved.