Scala compiler optimization for immutability
Asked Answered
B

2

5

Does the scala compiler optimize for memory usage by removing refs to vals used only once within a block?

Imagine an object holding in aggregate some huge data - reaching a size where cloning data or derivatives of it may well scratch the maximum amount of memory for the JVM/machine.

A minimal code example, but imagine a longer chain of data transforms:

val huge: HugeObjectType
val derivative1 = huge.map(_.x)
val derivative2 = derivative1.groupBy(....)

Will the compiler e.g. leave huge marked eligible for garbage collection after derivative1 has been computed? or will it keep it alive until the wrapping block is exited?

Immutability is nice in theory, I personally find it addictive. But to be a fit for big data objects that can't be stream-processed item by item on current-day operating systems - I would claim that it is inherently impedance mismatched with reasonable memory utilization, for a big data application on the JVM isn't it, unless compilers optimize for such things as this case..

Biogeography answered 22/11, 2015 at 9:25 Comment(4)
I don't think the compiler can do a lot about GC (which is runtime). Check #6275897Fernanda
It would be very helpful to address, in answers, the overall question rather than (only) GC limitations...Biogeography
The compiler doesn't mark anything "eligible for garbage colllection". Things are garbage collected when there are no more references to them, not by the compiler instructing it to. That's why the right answer is about GC and its limitations.Universal
So then scala programs written in the immutable pattern generally expressed above, are inherently inadequate for processing big data in any (memory) efficient manner. Are you sure de-referencing can not by in any way be hinted by the byte-code? what if the chain of code in the example were e.g. split into functions?Biogeography
B
8

First of all: the actual freeing of unused memory happens whenever the JVM GC deems it necessary. So there is nothing scalac can do about this.

The only thing that scalac could do would be to set references to null not just when they go out of scope, but as soon as they are no longer used.

Basically

val huge: HugeObjectType
val derivative1 = huge.map(_.x)
huge = null // inserted by scalac
val derivative2 = derivative1.groupBy(....)
derivative1 = null // inserted by scalac

According to this thread on scala-internals, it currently does not do this, nor does the latest hotspot JVM provide salvage. See the post by scalac hacker Grzegorz Kossakowski and rest of that thread.

For a method that is being optimised by the JVM JIT compiler, the JIT compiler will null references as soon as possible. However, for a main method that is executed only once, the JVM will never attempt to fully optimise it.

The thread linked above contains a pretty detailed discussion of the topic and all the tradeoffs.

Note that in typical big data computing frameworks such as apache spark, the values you work with are not direct references to the data. So in these frameworks the lifetime of references is usually not a problem.

For the example given above, all intermediate values are used exactly once. So an easy solution is to just define all intermediate results as defs.

def huge: HugeObjectType
def derivative1 = huge.map(_.x)
def derivative2 = derivative1.groupBy(....)
val result = derivative2.<some other transform>

A different yet very potent approach, is to use iterators! chaining functions like map and filter over an iterator processes them item by item, resulting in no intermediary collections ever being materialized.. which fits the scenario very well! this will not help with functions like groupBy but may significantly reduce memory allocation for the former functions and similar ones. Credits to Simon Schafer from the above mentioned.

Bandmaster answered 22/11, 2015 at 12:5 Comment(3)
Thank you for addressing both the gist and detail of the question so clearly and for having provided the currently most relevant google group thread! I added about streaming computation with iterators to the answer before marking it accepted. Defs have been heavily criticized in the other answer, not entirely sure how much that criticism weighs in in relation to avoiding memory sprawl with big data...Biogeography
Not entirely sure what constitutes "a main method that executes only once", in this context. I assume we're not talking about just the application's single main function... any good link for a definitive definition anyone?Biogeography
I seriously doubt that the additional class file will make any difference in a situation like this. And yes, defs will be evaluated on each use. That's the point.Kinross
F
2

derivative1 is going to be garbage collected once it falls out of scope (and there are no other references to it). To assure that happens as soon as possible, do this:

val huge: HugeObjectType
val derivative2 = {
    val derivative1 = huge.map(_.x)
    derivative1.groupBy(....)
}

This is also better from code readability perspective, since it is obvious that the sole reason for derivative1's existence is derivative2, and that it is not used anymore after the closing bracket.

Fond answered 22/11, 2015 at 16:54 Comment(4)
The previous answer mentions defs, any difference at all other than syntax and your take on such lifespan laden coding style? I'd gladly adopt this style for some cases... not necessarily for longer chains of manipulations though..Biogeography
defs are dangerous, they get recomputed on each access and they cause additional .class files to be created. And I suggest you do adopt this style instead by limiting the scope. That has an additional advantage of putting programmer's mind on ease, so he can be sure he's done with that value, and is not going to be secretly used 100 lines down the execution path.Fond
Good to realize that. Still this would be one block per transformation, in the general case of a longer chain of data transformations. The general objective has been keeping memory consumption more or less converging to constant, throughout the chain of transformations, garbage collections assumed. Each block being introduced only clears that goal for one transformation then... so still useful, although a little horrible in terms of code style in long chains of data transformation.Biogeography
By the way, why would I really care about a few extra class files that much? :)Biogeography

© 2022 - 2024 — McMap. All rights reserved.