List of maps: efficient implementations
Asked Answered
C

4

5

I have code that creates and uses a collection such as:

List<Map<String, Object>> tableData;

This list of maps gets populated with n maps each representing one row in a database. Each row is represented as a map between the field name and the object corresponding to the field (the type is irrelevant in this case). Some of the fields might be missing. The number of fields, m is always much smaller than the number of rows (n ≈ 10000 × m). I need to reuse the same collection a few times to read through all the rows, so I can't just use some kind of lazy iterator.

Is there an efficient data structure to store this? Guava provides a Table collection but that doesn't seem to fit the bill. I am thinking about creating an interface such as:

interface TableData{
  int size();
  Map<String, Object> get(int i);
  // ... (interators, etc.)
}

And then create an implementation that uses one Map<String,List<Object>> so that I only instantiate m lists instead of n maps and create maps on the fly only when needed but I was wondering if there was a more general purpose data structure out there.

Thanks

Carlow answered 28/10, 2013 at 16:37 Comment(2)
could you please give a reason why Guava Table, such as HashBasedTable will not work for you ? Your in-memory table size is not super-big, and overhead of having HashMap vs ArrayList per row is a constant factor quite close to 3, according to stackoverflow.com/questions/1526596 Could you provide rough estimation of your memory constraints ?Asdic
A factor of 3 is not negligible if the total size of the data is 1-200 MB. I'm trying to explore all the options. Also how much more efficient is the <code>HashBasedTable</code>?Carlow
C
4

I ran some tests (not conclusive by any means but very indicative) to establish the memory footprint of different List<Map<String, Object>> implementations. The baseline is Java's ArrayList<> with the elements being instances of Guava's ImmutableMap.

The implementations I compared to are the following:

  1. Implementation based on a Map<String,List<Object>> using a HashMap and ArrayLists;
  2. Implementation based on a List<Object[]> using an ArrayList;
  3. Guava's HashBasedTable<Integer,String,Object>;
  4. Guava's ArrayTable<Integer,String,Object>;

My test consisted in generating n random rows each having m columns and a "fill factor" of k, where the fill factor is defined as the probability that each row contains values for all the columns. For simplicity, the values are random strings of length l generated using Apache Commons RandomStringUtils.

But let's get to the results. Having n = 200000, m = 50, l = 10 and k in (1.0, 7.5, 0.5) I got the following memory footprints as percentage of the baseline:

    | k = 1.0  | k = 0.75 | k = 0.5  |
----------------------------------------
1.  |     71 % |     71 % |     71 % |
2.  |     71 % |     72 % |     73 % |
3.  |    111 % |    107 % |    109 % |
4.  |     71 % |     73 % |     76 % |

I tried reducing n to 20000 with about the same results.

I found the results above quite interesting. First of all, it looks like there isn't much space for improvement beyond 70% of the baseline. Second, I was pleasantly surprised to find out that the efficient Guava's ArrayTable is as good as the two implementations proposed in this question. I'll keep digging for more but I'm leaning towards solution 1.

Thanks

Carlow answered 29/10, 2013 at 13:59 Comment(1)
Nice research. This is helpful to me.Superfluid
S
4

First please make sure you really need to optimize.

Assuming that on the average no more than maybe 50% of the columns are missing, List<Object[]> is the clear winner:

class TableDataImpl implements TableData {
    private List<Object[]> data;
    private Map<String, Integer> columnNameToIndexMap;

    public Map<String, Object> get(int i) {
        return new ArrayMap(data.get(i));
    }

    private class ArrayMap implements Map<String, Object> {

        private Object[] row;

        ArrayMap(Object[] row) {
            this.row = row;
        }

        public Object get(String key) {
            Integer index = columnNameToIndexMap.get(key);
            if (index==null) return null;
            return row[index];
       }

       // all the other Map stuff... a lot of code!
    }
}

I woudn't call it simple, so make sure you really need to optimize.

Otherwise, assuming that on the average no more than maybe 95% of the columns are missing, a slightly more complicated construction should do: For each row, use a home-grown BitSet (long[]) for storing which columns exist. This way you'd waste a single bit only rather than a whole entry (32 or 64 bits) in the Object[].

This is even more complicated, so make sure you really need to optimize.

Assuming that many rows share the same set of columns, you could store the columnNameToIndexMap within each row.

Stipe answered 28/10, 2013 at 19:45 Comment(1)
The first idea is an interesting approach. One collection (plus a map of field names) with n arrays. I'm going to explore it.Carlow
C
4

I ran some tests (not conclusive by any means but very indicative) to establish the memory footprint of different List<Map<String, Object>> implementations. The baseline is Java's ArrayList<> with the elements being instances of Guava's ImmutableMap.

The implementations I compared to are the following:

  1. Implementation based on a Map<String,List<Object>> using a HashMap and ArrayLists;
  2. Implementation based on a List<Object[]> using an ArrayList;
  3. Guava's HashBasedTable<Integer,String,Object>;
  4. Guava's ArrayTable<Integer,String,Object>;

My test consisted in generating n random rows each having m columns and a "fill factor" of k, where the fill factor is defined as the probability that each row contains values for all the columns. For simplicity, the values are random strings of length l generated using Apache Commons RandomStringUtils.

But let's get to the results. Having n = 200000, m = 50, l = 10 and k in (1.0, 7.5, 0.5) I got the following memory footprints as percentage of the baseline:

    | k = 1.0  | k = 0.75 | k = 0.5  |
----------------------------------------
1.  |     71 % |     71 % |     71 % |
2.  |     71 % |     72 % |     73 % |
3.  |    111 % |    107 % |    109 % |
4.  |     71 % |     73 % |     76 % |

I tried reducing n to 20000 with about the same results.

I found the results above quite interesting. First of all, it looks like there isn't much space for improvement beyond 70% of the baseline. Second, I was pleasantly surprised to find out that the efficient Guava's ArrayTable is as good as the two implementations proposed in this question. I'll keep digging for more but I'm leaning towards solution 1.

Thanks

Carlow answered 29/10, 2013 at 13:59 Comment(1)
Nice research. This is helpful to me.Superfluid
K
0

Well, if it is important to have all of the table data in memory at once, it isn't going to make much of a difference which direction you store the data structures (as a List of Maps or a Map of Lists). The List of Maps is obviously much more intuitive, so I'd keep that.

If you are concerned about the efficiency of Object creation and cleanup, I'd suggest using an Object pool. Here's a basic idea of how it might work:

public class TableRowPool {

    private static final int INITIAL_CAPACITY = 10000;

    private Queue<Map<String, Object>> mapObjects;

    public TableRowPool() {
        mapObjects = new LinkedList<Map<String, Object>>();
        for(int i = 0; i < INITIAL_CAPACITY; i++) {
            mapObjects.add(new HashMap<String, Object>());
        }
    }

    public Map<String, Object> getTableRowObject() {
        if(mapObjects.size() == 0) {
            mapObjects.add(new HashMap<String, Object>());
        }
        return mapObjects.remove();
    }

    public void returnTableRowObject(Map<String, Object> obj) {
        mapObjects.add(obj);
    }

}

The LinkedList performs well as a queue, so object retrieval will be fast. It also is fast at appending new objects, if you want it to dynamically grow. You may need to change the data structures depending on whether it needs to be thread-safe, however.

To use the object pool, you would do something like the following:

//Load data
while((row = getResultSetRow()) != null) {
    Map<String, Object> rowObj = tableRowPool.getTableRowObject();
    //Fill in data
    myRows.add(rowObj);
}

//... Do all your business logic ...

//Cleanup
for(Map<String, Object> rowObj : myRows) {
    tableRowPool.returnTableRowObject(rowObj);
}
myRows = null;
Keciakeck answered 28/10, 2013 at 17:4 Comment(6)
Thanks for the answer. I did once explored the possibility of using an object pool. I find this approach unsafe as it puts a big burden on the client: to return the object and never use it again. That's a big responsibility that in my experience is easy to violate. Moreover, it breaks the separation of concerns principle by forcing the client to know about the implementation of the collection.Carlow
I am concerned with object creation but most of all with memory usage. The overhead of keeping so many collections in memory can overwhelm the application at times and throw OOM errors. If I keep the old collection, I'll have one list with n maps, each with an average of m elements in it, while if I create a map of lists, now I have m lists of size n in one map which is way less collections to keep in memory. If the size of the table and the number of fields were known the collection could be simplified even more (see Guava ArrayTable).Carlow
I would assume its the actual data you are keeping in memory, rather than the data structures that is going to be the memory hog. Yes, you can get some savings (you could create a pre-sized 2d array when you get the result set if you know the number of rows and then have a single map of column name to array index). It may not be reasonable to try to have all of the rows in memory, however. Depending on the query, it often isn't all that expensive to just issue it again and run your business logic on each row as you iterate them.Keciakeck
For the pre-sized array idea, it would be something like this. Object[][] data = new Object[rows][columns]; and Map<String, Integer> columnMap = new HashMap<String, Integer>(). Then you iterate the column names and add them to columnMap. Next you iterate the rows and assign array values like this: data[i][columnMap.get(dbRow.columnName())] = dbRow.value();.Keciakeck
Hey Chill, thanks for the reply. Unfortunately the data doesn't come from a database and the number of rows is not known beforehand.Carlow
You can do the same thing with Lists of Object[]. That being said, sometimes it really isn't a good idea to try to cram everything into memory. As @snegi noted, your data may change and outgrow what you can handle. Or you may not be able to deploy your code to a slightly inferior machine. I think you should seriously consider why you need to have all the rows of data in memory at the same time.Keciakeck
P
0

If i have such a large data that i am afraid that i will get OOM , then instead of finding an optimum datastructure to keep this data i would look for how i can use SIMD parallelism or something like Map-Reduce. However much you optimize your datastructure, you can always run out of memory space. E.g if you do find an optimum datastructure which works in a particular machine configuration, it might still not work in a machine with slightly lesser RAM.

But if you still want to stick to your current approach, why cant you normalize the data so that you can represent the fields that are missing by : 'Null' . So when you read data and create a map, why not just add 'null' for the absent fields ? That way you atleast do not have to a key-value datastructure like hashmap and you can just lise List<List<Object>>

Porthole answered 28/10, 2013 at 18:26 Comment(1)
That's an interesting idea that I had already thought of but that way I still have n collections instead of only m.Carlow

© 2022 - 2024 — McMap. All rights reserved.