HashMap alternatives for memory-efficient data storage
Asked Answered
P

11

30

I've currently got a spreadsheet type program that keeps its data in an ArrayList of HashMaps. You'll no doubt be shocked when I tell you that this hasn't proven ideal. The overhead seems to use 5x more memory than the data itself.

This question asks about efficient collections libraries, and the answer was use Google Collections. My follow up is "which part?". I've been reading through the documentation but don't feel like it gives a very good sense of which classes are a good fit for this. (I'm also open to other libraries or suggestions).

So I'm looking for something that will let me store dense spreadsheet-type data with minimal memory overhead.

  • My columns are currently referenced by Field objects, rows by their indexes, and values are Objects, almost always Strings
  • Some columns will have a lot of repeated values
  • primary operations are to update or remove records based on values of certain fields, and also adding/removing/combining columns

I'm aware of options like H2 and Derby but in this case I'm not looking to use an embedded database.

EDIT: If you're suggesting libraries, I'd also appreciate it if you could point me to a particular class or two in them that would apply here. Whereas Sun's documentation usually includes information about which operations are O(1), which are O(N), etc, I'm not seeing much of that in third-party libraries, nor really any description of which classes are best suited for what.

Predicant answered 19/10, 2010 at 19:47 Comment(2)
Here is a tool to help you benchmark the memory footprint of whatever structure you choose: code.google.com/p/memory-measurer, and see some example data I derived by it: code.google.com/p/memory-measurer/wiki/…Aero
Above links got brockenUnleash
S
3

So I'm assuming that you have a map of Map<ColumnName,Column>, where the column is actually something like ArrayList<Object>.

A few possibilities -

  • Are you completely sure that memory is an issue? If you're just generally worried about size it'd be worth confirming that this will really be an issue in a running program. It takes an awful lot of rows and maps to fill up a JVM.

  • You could test your data set with different types of maps in the collections. Depending on your data, you can also initialize maps with preset size/load factor combinations that may help. I've messed around with this in the past, you might get a 30% reduction in memory if you're lucky.

  • What about storing your data in a single matrix-like data structure (an existing library implementation or something like a wrapper around a List of Lists), with a single map that maps column keys to matrix columns?

Speciality answered 19/10, 2010 at 20:7 Comment(4)
Actually each record is a Map<Field,Object> which Object is the values of each field. All the records are contained in an ArrayList. Memory is definitely an issue. Loading a file of 50MB can sometimes exceed 1GB of memory, which leads me to believe that my current implementation is horribly naive.Predicant
I will do some testing with different options; what I'm trying to do here is narrow the field to a few specific classes within different libraries that I can compare.Predicant
@bemace: Are you reusing the same Field objects for each record Map instance?Hegyera
3rd point led me to my eventual solution using Colt's SparseMatrix (see my answer below)Predicant
K
13

Some columns will have a lot of repeated values

immediately suggests to me the possible use of the FlyWeight pattern, regardless of the solution you choose for your collections.

Know answered 19/10, 2010 at 19:49 Comment(1)
While not addressing the main problem, this did nudge me to finally figure out how to flyweight Strings in java properly. Thanks. #3973341Predicant
B
5

Trove collections should have a particular care about space occupied (I think they also have tailored data structures if you stick to primitive types).. take a look here.

Otherwise you can try with Apache collections.. just do your benchmarks!

In anycase, if you've got many references around to same elements try to design some suited pattern (like flyweight)

Beavers answered 19/10, 2010 at 19:50 Comment(2)
Trove won't work for me because I'm not using primitives. I see HashedMap in Apache collections is a "general purpose alternative" but they don't give any explanation of what's different from regular HashMap. Have any insight there?Predicant
Actually, I see it does mention adding iteration functionality. However, my issue is with performance not missing features.Predicant
S
5

Chronicle Map could have overhead of less than 20 bytes per entry (see a test proving this). For comparison, java.util.HashMap's overhead varies from 37-42 bytes with -XX:+UseCompressedOops to 58-69 bytes without compressed oops (reference).

Additionally, Chronicle Map stores keys and values off-heap, so it doesn't store Object headers, which are not accounted as HashMap's overhead above. Chronicle Map integrates with Chronicle-Values, a library for generation of flyweight implementations of interfaces, the pattern suggested by Brian Agnew in another answer.

Sol answered 18/3, 2017 at 22:24 Comment(0)
S
3

So I'm assuming that you have a map of Map<ColumnName,Column>, where the column is actually something like ArrayList<Object>.

A few possibilities -

  • Are you completely sure that memory is an issue? If you're just generally worried about size it'd be worth confirming that this will really be an issue in a running program. It takes an awful lot of rows and maps to fill up a JVM.

  • You could test your data set with different types of maps in the collections. Depending on your data, you can also initialize maps with preset size/load factor combinations that may help. I've messed around with this in the past, you might get a 30% reduction in memory if you're lucky.

  • What about storing your data in a single matrix-like data structure (an existing library implementation or something like a wrapper around a List of Lists), with a single map that maps column keys to matrix columns?

Speciality answered 19/10, 2010 at 20:7 Comment(4)
Actually each record is a Map<Field,Object> which Object is the values of each field. All the records are contained in an ArrayList. Memory is definitely an issue. Loading a file of 50MB can sometimes exceed 1GB of memory, which leads me to believe that my current implementation is horribly naive.Predicant
I will do some testing with different options; what I'm trying to do here is narrow the field to a few specific classes within different libraries that I can compare.Predicant
@bemace: Are you reusing the same Field objects for each record Map instance?Hegyera
3rd point led me to my eventual solution using Colt's SparseMatrix (see my answer below)Predicant
T
3

Assuming all your rows have most of the same columns, you can just use an array for each row, and a Map<ColumnKey, Integer> to lookup which columns refers to which cell. This way you have only 4-8 bytes of overhead per cell.

If Strings are often repeated, you could use a String pool to reduce duplication of strings. Object pools for other immutable types may be useful in reducing memory consumed.

EDIT: You can structure your data as either row based or column based. If its rows based (one array of cells per row) adding/removing the row is just a matter of removing this row. If its columns based, you can have an array per column. This can make handling primitive types much more efficent. i.e. you can have one column which is int[] and another which is double[], its much more common for an entire column to have the same data type, rather than having the same data type for a whole row.

However, either way you struture the data it will be optmised for either row or column modification and performing an add/remove of the other type will result in a rebuild of the entire dataset.

(Something I do is have row based data and add columnns to the end, assuming if a row isn't long enough, the column has a default value, this avoids a rebuild when adding a column. Rather than removing a column, I have a means of ignoring it)

Tarnish answered 19/10, 2010 at 20:43 Comment(2)
If the original poster's values really are dense, this will work great. Object[][] or List<Object[]>. Don't discount the old standbys! Add Field#getNumber() and you're golden. As for duplication of values, guava-libraries 'Interner' interface would seem to fit the bill.Osmium
Not a bad idea but how do you handle adding and remove rows/columns with that sort of structure?Predicant
C
2

Guava does include a Table interface and a hash-based implementation. Seems like a natural fit to your problem. Note that this is still marked as beta.

Coronagraph answered 19/10, 2010 at 20:33 Comment(3)
The Guava Table implementations are implemented as a Map with Map values. As a result, they won't reduce memory usage.Gearhart
@Jared I would assume that would depend on the implementation of Map used?Predicant
@bemace The primary Map implementations -- HashMap and TreeMap -- have significant memory overhead when there are many of them. I created an array-based table, which is memory-efficient when dense but requires fixed row and column keys, but it hasn't yet been open-sourced in Guava.Gearhart
I
1

keeps its data in an ArrayList of HashMaps
Well, this part seems terribly inefficient to me. Empty HashMap will already allocate 16 * size of a pointer bytes (16 stands for default initial capacity), plus some variables for hash object (14 + psize). If you have a lot of sparsely filled rows, this could be a big problem.

One option would be to use a single large hash with composite key (combining row and column). Although, that doesn't make operations on whole rows very effective.

Also, since you don't mention the operation of adding cell, you can create hashes with only necessary inner storage (initialCapacity parameter).

I don't know much about google collections, so can't help there. Also, if you find any useful optimization, please do post here! It would be interesting to know.

Ignazio answered 19/10, 2010 at 20:6 Comment(1)
I assure you it is terribly inefficient, that's why I'm here :) In my case sparse rows aren't a big issue.Predicant
P
1

I've been experimenting with using the SparseObjectMatrix2D from the Colt project. My data is pretty dense but their Matrix classes don't really offer any way to enlarge them, so I went with a sparse matrix set to the maximum size.

It seems to use roughly 10% less memory and loads about 15% faster for the same data, as well as offering some clever manipulation methods. Still interested in other options though.

Predicant answered 29/10, 2010 at 15:10 Comment(0)
P
1

For me apache commons collections did not save any space, here are two similar heap dumps just before OOME comparing Java 11 HashMap to Apache Commons HashedMap: enter image description here enter image description here

The Apache Commons HashedMap doesn't appear to make any meaningful difference.

Parclose answered 12/8, 2021 at 22:56 Comment(0)
M
0

From your description, it seems that instead of an ArrayList of HashMaps you rather want a (Linked)HashMap of ArrayList (each ArrayList would be a column).

I'd add a double map from field-name to column-number, and some clever getters/setters that never throw IndexOutOfBoundsException.

You can also use a ArrayList<ArrayList<Object>> (basically a jagged dinamically growing matrix) and keep the mapping to field (column) names outside.

Some columns will have a lot of repeated values

I doubt this matters, specially if they are Strings, (they are internalized) and your collection would store references to them.

Macgregor answered 19/10, 2010 at 20:9 Comment(0)
C
0

Why don't you try using cache implementation like EHCache. This turned out to be very effective for me, when I hit the same situation.
You can simply store your collection within the EHcache implementation. There are configurations like:

Maximum bytes to be used from Local heap.

Once the bytes used by your application overflows that configured in the cache, then cache implementation takes care of writing the data to the disk. Also you can configure the amount of time after which the objects are written to disk using Least Recent Used algorithm. You can be sure of avoiding any out of memory errors, using this types of cache implementations. It only increases the IO operations of your application by a small degree.
This is just a birds eye view of the configuration. There are a lot of configurations to optimize your requirements.

Chewink answered 21/7, 2012 at 14:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.