Immutable Map implementation for huge maps
Asked Answered
M

3

10

If I have an immutable Map which I might expect (over a very short period of time - like a few seconds) to be adding/removing hundreds of thousands of items from, is the standard HashMap a bad idea? Let's say I want to pass 1Gb of data through the Map in <10 seconds in such a way that the maximum size of the Map at any once instant is only 256Mb.

I get the impression that the map keeps some kind of "history" but I will always be accessing the last-updated table (i.e. I do not pass the map around) because it is a private member variable of an Actor which is updated/accessed only from within reactions.

Basically I suspect that this data structure may be (partly) at fault for issues I am seeing around JVMs going out of memory when reading in large amounts of data in a short time.

Would I be better off with a different map implementation and, if so, what is it?

Marquet answered 19/2, 2010 at 16:49 Comment(0)
M
20

Ouch. Why do you have to use an immutable map? Poor garbage collector! Immutable maps generally require (log n) new objects per operation in addition to (log n) time, or they really just wrap mutable hash maps and layer changesets on top (which slows things down and can increase the number of object creations).

Immutability is great, but this does not seem to me like the time to use it. If I were you, I'd stick with scala.collection.mutable.HashMap. If you need concurrent access, wrap the Java util.concurrent one instead.

You also might want to increase the size of the young generation in the JVM: -Xmn1G or more (assuming you're running with -Xmx3G). Also, use the throughput (parallel) garbage collector.

Monograph answered 19/2, 2010 at 17:6 Comment(6)
Yes - I've changed it to use the mutable Map but I though the whole point of FP was that immutability was great! This app should easily run in less than 256Mb of memory from a "how much data does it really need at any one point in time" perspective.Marquet
How great immutability is depends on the application. If you are running an application with, say, trees of message threads being sent around to a bunch of clients, immutability is a godsend--you just send the current tree and you don't have to worry about the data structure itself changing out from under you. (You do still have to catch cases like the client asking to add a comment to a thread that is deleted by the time they respond.) For high-throughput work in private data structures that are turning over very rapidly, immutability provides few advantages and demands a lot of overhead.Monograph
yeah, that'swhat I'm finding out! So how do Haskell or Clojure manage in these situations? Don't they enforce immutability?Marquet
In FP one would prefer a map based on an ordered, balanced tree, rather than one based on a hashtable - exactly because it's hard to get a persistent version of that. From my understanding, FP people are quick to give up a hashtable's O(1) lookup for a balanced tree's O(logn), with a reasoning basically amounting to "who cares about that much performance, it doesn't matter, I like immutability more". Well, some people just prefer performance :)Classmate
AFAICT, Clojure uses Java's collections when it needs mutable data structures. Haskell uses the best available immutable algorithms and then considers an occasional log N as a worthwhile sacrifice for purity.Monograph
Haskell has escape hatches for when you can't get the performance you need from immutable data structures. For example, Haskell's Data.HashTable operations are in the IO monad (which essentially means that they're not purely functional).Joust
C
8

That would be awful. You say you always want to access the last-updated table, that means you only need an ephemeral data structure, there is no need to pay the cost for a persistent data structure - it's like trading time and memory to gain completely arguable "style points". You are not building your karma by using blindly persistent structures when they are not called for.

Also, a hashtable is a particularly difficult structure to make persistent. In other words, "very, very slow" (basically it is usable when reads greatly outnumber writes - and you seem to talk about many writes).

By the way, a ConcurrentHashMap wouldn't make sense in this design, given that the map is accessed from a single actor (that's what I understand from the description).

Classmate answered 19/2, 2010 at 17:29 Comment(4)
You're right - I have no requirement for immutability or concurrency. My map is a cache private to a single actorMarquet
I assume, since you are using actors, you want to take advantage of concurrency opportunities. This map/actor, which now sounds like a potential bottleneck, might be a good such opportunity (irrelevant from actors though), i.e. you might benefit by making the map a shared ConcurrentHashMap (not owned by any single actor) and let writers proceed concurrently, if possible/reasonable.Classmate
The data is the map does not need to be shared between actors - there are about 30 actor-instances, each with their own maps (of data relevant only to them)Marquet
Too often I hear the solution to immutable map performance issues in Scala is "go mutable"... You can't talk about "style points" when it comes to sticking to immutable data structures in functional programming, it's much more fundamental and important than that. Either we can do it and there is really a different programming paradigm, or we have to second-guess ourselves every time, and therefore there is no real viable new programming paradigm here.Charles
P
4

Scala's so-called(*) immutable Map is broken beyond basic usage up to Scala 2.7. Don't trust me, just look up the number of open tickets for it. And the solution is just "it will be replaced with something else on Scala 2.8" (which it did).

So, if you want an immutable map for Scala 2.7.x, I'd advise looking for it in something other than Scala. Or just use TreeHashMap instead.

(*) Scala's immutable Map isn't really immutable. It is a mutable data structure internally, which requires lot of synchronization.

Papeterie answered 19/2, 2010 at 17:39 Comment(3)
Where do I find an alternative? Is TreeMap OK? I can hardly find one in JavaMarquet
@oxbow I think people were using HashTreeMap as an alternative indeed, now that you mention it.Papeterie
Scala's immutable map seems to have been fixed recently. The code is a trie now, the old logging scheme seems to be gone.Memorable

© 2022 - 2024 — McMap. All rights reserved.