how fragile is escape analysis in Hotspot in simple cases such as iterator in for-each loop
Asked Answered
C

2

5

Suppose I have a java.util.Collection that I want to loop over. Normally I'd do this:

for(Thing thing : things) do_something_with(thing);

But suppose that this is in some core utility method that is used all over the place, and in most places, the collection is empty. Then ideally we would prefer not impose an iterator allocation on every single caller just to perform a no-op loop, and we could rewrite things like this:

if(things.isEmpty()) return;
for(Thing thing : things) do_something_with(thing);

An even more extreme option, if things is a List, would be to use a C-style for loop.

But wait, Java-escape analysis should eliminate this allocation, at least after the C2 compiler gets around to this method. So there should be no need for this "nano-optimization". (I won't even dignify it with the term micro-optimization; it's a bit too small for that.) Except...

I keep hearing that escape analysis is "fragile" but no one seems to ever talk about what in particular can mess it up. Intuitively, I would think that more complicated control flow would be the main thing to fear, which means iterators in a for-each loop should be RELIABLY eliminated, since there the control flow is simple.

The standard response here is to try running an experiment, but unless I know the variables in play, it's pretty hard to trust any conclusions I might be tempted to draw from such an experiment.

Indeed, here's a blog post where someone tried such an experiment, and 2 out of 3 profilers gave the wrong results:

http://psy-lob-saw.blogspot.com/2014/12/the-escape-of-arraylistiterator.html

I know a lot less about obscure JVM wizardry than the author of that blog post, and am likely to be far more easily misled.

Cesarean answered 1/6, 2021 at 16:30 Comment(6)
Wait, that's not what I expected at all. Why should it matter whether do_something_with gets inlined or not? That method never gets its hands on the iterator, so it can't affect whether the iterator escapes or not. What gives?Cesarean
Sorry for confusion, I've removed my comment as it is inaccurate for modern HotSpot.V2
BTW, a recent discussion on current limitations of Escape Analysis and the proposal how to improve it.V2
E.g. an easy way to break scalarization in your example is to feed the method with different collection types. If things is always ArrayList, scalarization will likely succeed. But if sometimes it is ArrayList, and sometimes Collections.emptyList(), the allocation won't be eliminated.V2
Re: partial escape analysis: exciting! I thought that C2 was rumored to be incapable of such things and the only hope was Graal. Fingers crossed. Re: fluctuating types: yeah, that sounds like expecting way too much from the compiler. In my actual use case, the type is statically known. (In particular, it's CopyOnWriteArrayList, whose iterators are even simpler than those of ArrayList.) If the the first example you thought of requires something that crazy, then I suppose that Hotspot's escape analysis must be pretty robust after all. Or am I trying to infer too much from your example?Cesarean
I've summarized the comments in the answer, added more details and links.V2
V
4

Scalar Replacement is indeed a kind of optimization that you can never be absolutely certain about, because it depends on too many factors.

First, an allocation may be eliminated only when all uses of the instance are inlined in one compilation unit. In case of an iterator, it means that the iterator constructor, hasNext and next calls (including the nested calls) must be inlined.

public E next() {
    if (! hasNext())
        throw new NoSuchElementException();
    return (E) snapshot[cursor++];
}

However, inlining itself is a fragile optimization in HotSpot, since it relies on many heuristics and limits. For example, it may happen that iterator.next() call is not fully inlined into the loop due to reaching the maximum inlining depth, or because the outer compilation is already too big.

Second, scalar replacement does not happen if a reference conditionally receives different values.

for(Thing thing : things) do_something_with(thing);

In your example, if things is sometimes ArrayList and sometimes Collections.emptyList(), the iterator will be allocated on the heap. For elimination to happen, the type of the iterator must always be the same.

There are more examples in a great talk on Scalar Replacement by Ruslan Cheremin (it's in Russian, but YouTube's subtitles translation feature to the rescue).

Another recommended read is Aleksey Shipilёv's blog post, which also demonstrates how to use JMH to verify whether Scalar Replacement happens in a particular scenario or not.

In short, in simple cases like yours, there is a high chance that allocation elimination will work as expected. There may be some marginal cases though, as I mentioned above.

There is a recent discussion on hotspot-compiler-dev mailing list regarding Partial Escape Analysis proposal. If implemented, it could significantly extend the applicability of Scalar Replacement optimization.

V2 answered 2/6, 2021 at 1:4 Comment(2)
Thanks for the talk! I can't really read Russian but my spoken Russian is pretty okay. Will try watching that soon. So it seems that if the type of the collection and hence the iterator is statically known, and further if I'm confident that the relevant calls are in-lined, then the odds are in my favor.Cesarean
Finally watched most of the talk; did not realize that Arrays.asList() breaks escape analysis in Java 8... luckily it's fixed in 9. Yet more proof that inner classes should be static by default :)Cesarean
A
5

Your approach does not work. The correct approach is this:

  • Unless you are a performance expert (which is hard to become), do not make assumptions about what kind of code performs well vs. performs poorly, and maintain skepticism when analysing profiler reports. This is not particularly useful advice (it boils down to: A profiler report may be lying to you!), but it is what it is. Effectively, either be a performance expert, or accept that there's not much you can do about it. Sucks, but, don't shoot the messenger.
  • Write idiomatic java code. It is easiest to maintain and most likely to get optimized by hotspot.
  • Reduction of algorithmic complexity is useful and should always be the first thing you check. To some extent, an optimization that reduces algorithmic complexity gets to ignore the first rule. You do not need to be particularly up to date on the vagaries of JVMTI or Flight Recorder and how profilers work to conclude that an algorithmic rewrite is worthwhile and is going to significantly improve performance.
  • do not trust pithy rules of thumb, no matter how many people are saying it. Do not look for 'easy to apply patterns' like 'replace all foreach loops by appending an if-block that tests for empty first' - these are essentially never correct and usually reduce performance.
  • Be aware that bad performance advice is rampant. You should never treat ubiquitous presence of some argument otherwise bereft of proof or research as "that makes it more likely to be true" as a general principle in life and logical reasoning (it is, after all, a logical fallacy!), but this counts double for performance!

More in-depth musing

Presumably, you're not going to trust the above maxims just because I'm telling you to trust them. I'll try to take you through some falsifiable reasoning to show you why the above maxims are correct.

In particular, this idea of checking for empty first seems extremely misguided.

Let's first translate the overly hyperbolical and therefore rather useless well-known maxim premature optimization is the root of all evil into something more tangible:

Do not make your code an ugly, caveat-ridden mess of weirdness because of an imagined performance issue.

Why can't I go by often-heard maxims?

Do not go by "people" here. Because "people" are notorious for being utterly wrong on performance, time and time again. If you can find widespread, pithy and entirely bereft of proof or research statements that X is good or bad for performance, you can rest assured in the thought that this means absolutely nothing whatsoever. Your average joe twitter writter or whatnot is a clueless idiot in this regard. Proof, ample research, or credentials are an absolute requirement to take things seriously, preferably 2 or 3 of those. There are lists of well known performance falsehoods (commonly held beliefs about how to improve JVM performance that absolutely do not help whatsoever and often actually hurt), and if you then search for these falsehoods you can find entire hordes of folks who espouse it, thus proving that you just cannot trust anything based solely on the fact that you "keep hearing it".

Note also that for just about every imaginable line of java code, you can come up with 100+ plausible if somewhat exotic ideas on how to make the code less obvious but seemingly 'more performant'. Clearly then you can't apply all 100 variants to every line in the entire project, so the road you were planning on taking here ("I do not quite trust that profiler, I find it plausible escape analysis will fail to eliminate this iterator allocation, so, just to be safe I will add an if that checks for empty first"), ends in a disaster where even the simplest task becomes a many-lined, seemingly excessively redundant soup. And performance will be worse on average, so it's a lose-lose scenario.

Here's a simple example to drive the point home, and you can watch those presentations by Doug for more of this sort of thing:

List<String> list = ... retrieve thousands of entries ...;
String[] arr1 = list.toArray(new String[list.size()]);
String[] arr2 = list.toArray(new String[0]);

It is quite plausible that the arr1 line is faster, right? It avoids creating a new array that is then immediately eligible for garbage collection. And yet, turns out, arr2 is faster because hotspot recognizes this pattern and will optimize out the zeroing-out of that array (not a thing you can do in java, but perfectly possible in machine code of course), because it knows that all bytes are overwritten regardless.

Why should I write idiomatic java code?

Keep in mind, hotspot is a system that tries to identify patterns, and applies optimizations to these patterns. There are an infinite amount of patterns one could theoretically optimize. Thus, hotspot code is designed to search for useful patterns: Take a given pattern, and calculate [odds that this appears in your average java project * how often it would show up in performance crucial code paths * amount of performance gain we can realize for it]. You should take away from this that you should write idiomatic java code. If you write bizarro java code nobody else writes, hotspot is vastly more likely to fail to optimize it, because the authors of hotspot tooling are people too and they optimize for common cases, not for weirdness. SOURCE: Douglas Hawkins, JVM performance engineer at Azul for example, this devoxx presentation, and many other JVM performance engineers have said similar things.

In passing, you get code that is easy to maintain, and easy to explain - because other java coders will read it and find familiar ground.

Seriously, become a performance expert, that is the only way?

Mostly. But, hey, CPU and memory is quite cheap, and hotspot rarely makes algorithmic improvements (as in, hotspot will rarely turn an algorithm that is O(n^2) into e.g. an O(n) one, as in: If you graph 'size of input' against 'time taken to run the algorithm', that the algorithm appears to result in a curve that looks like y = x^2, but hotspot manages to turn that into a y = x linear affair. That's rare to impossible - the improvements tend to always be of constant factors, so throwing some more CPU cores and/or RAM at it is just as effective, generally.

Also, of course, algorithmic wins always dwarf whatever hotspot and micro/nano-optimizations could ever do for you.

Thus: Just write code that looks good, is easy to test, written in an idiomatic fashion, and uses the right, most efficient algorithms, and it'll run fast. If it isn't fast enough, throw more CPU or RAM at it. If it isn't fast enough, spend 10 years becoming an expert.

"Lets add an empty check, yaknow, just in case!" doesn't fit in that plan.

Adamant answered 1/6, 2021 at 17:26 Comment(24)
Thank you for your detailed answer. Or more precisely, your detailed non-answer :) Unfortunately, the limit on comment length is a bit short, so I will split my reply into two comments. Let me explain a bit more context. I am modifying existing code, and during code review someone reasonably pointed out that I might be introducing a performance regression by doing it the obvious way (no extra empty check). (see next comment for remainder)Cesarean
I thought about it, and pointed out that this is probably not a concern since escape analysis should handle it, but he expressed some skepticism that this can be counted on to work in practice. But now you counter-argue that perhaps the non-idiomatic performs even worse, since it's an unusual pattern the optimizer doesn't see often. But in this case, that can't realistically be true. To see why, consider the worst that an extra if statement could do here. (see next comment for rest.)Cesarean
Perhaps it tickles some pathological corner case in C2 and it gives up, so this method will only be compiled by C1. Suppose that costs me a factor of two slow-down. Pretty bad. But the cost of an extra allocation in a tight loop can be far worse, if it ends up triggering more frequent stop-the-world collections. (We have GC troubles often enough as it is, so this is not a theoretical concern.) (see next comment, again)Cesarean
In my opinion, this is not really comparable to your example of toArray because that's a pure CPU-time issue: both algorithms will allocate an array of length n; the question is which will fill it faster. As it happens, I already know this particular factoid, because I have read many of Aleksey Shipilëv's blog posts. (Certainly he qualifies as a performance expert.) I'm not a performance expert, and realistically will probably never become one. But I would likely to become MORE of an expert than I am now. Which means I must learn. (see next comment.)Cesarean
someone reasonably pointed out that I might be introducing a performance regression by doing it the obvious way - my point is, this is not reasonable. You can make a plausible sounding claim in that vein for just about every line of java code written anywhere on the planet. In crass parlance: "Proof, or GTFO", is the right approach here.Adamant
Your idea of 'hey, some basic analysis indicates that surely this will not hurt' is also incorrect: You can come up with hundreds of 'eh, cant hurt' interventions, and via death-by-a-thousand-cuts, the sum of it all does hurt performance, let alone the disaster it'll cause in your code's readability and maintainability. On its face, if (!list.isEmpty()) for (String item : list) {} is hard to read code, because that if construct seems redundant. It requires a comment explaining why you did that, and thus off we are to why this is not a good idea.Adamant
Re: hard to read: I agree! It is ugly, verbose, and non-intuitive! It needs a comment so the next person doesn't stare at it in confusion, wondering why it's there, and whether it's safe to remove. I would love an excuse to get rid of it! But I'm not going to pick a fight during code review insisting that it should be gotten rid of, unless I know I'm right. (On one occasion, a few months ago, I did pick such a fight: I was confident that a trivial getter would be in-lined. The reviewer was convinced.) (see next comment)Cesarean
In this case. I don't know that I'm right, because, as you pointed out, I'm no expert, and this is more complicated that "will a tiny getter be inlined". But I'm suspect that in this case there is a simple answer to this particular question, so I decided to ask.Cesarean
@MarkVY The blogpost you linked to, though, shows that [A] the answer is not simple, and [B] it rather clearly shows that basic for-eaching over empty collections does not incur the penalty of allocating the iterator object if hotspotted. You already had this answer before posting, hence why I decided to go for the more general principle of 'are there easy rules of thumb / should I mess up my code based on imagined / plausible hearsay?' (to which the answer is: There are none, and definitely do not do that).Adamant
So, you are right, with correct but complicated reasoning: If 'plausible but lacking proof performance reasons' are valid to keep code ugly, then you can defend just about every bizarre construction. You might as well stop code review entirely, what's the point? "But, performance!" becomes the catch-all end of any debate. It becomes a magic spell. Say it, and the reviewer just poofs out of existence.Adamant
@MarkVY “introducing a performance regression by doing it the obvious way” is a funny phrase, as it is the absolute opposite of sane recommendations. The runtime optimizer is a pattern matching engine and doing things in a non-obvious way bears the risk of losing optimizations because the JVM may fail to recognize the pattern. Therefore, any deviation from the straight-forward way of doing something requires lots of cross checking. Further, the impact of a temporary object might be lower than the costs of an additional things.isEmpty() check.Sideway
Sure, but the cost of the isEmpty() is relatively easy to get worst-case bounds on. the cost of many additional objects may be a stop-the-world GC pause, as the answer you just linked to pointed out: allocating lots of objects means you exhaust your TLAB sooner, and the rate of TLAB allocation affects how often you perform GCs. Hence this question: under which conditions can I rely on this particular optimization.Cesarean
In general I think it is well accepted that having an optimization you can rely on SHOULD change your programming style. For instance, in Scheme you have guaranteed tail call elimination, whereas in Python you don't. So in Scheme that means tail recursion is "free" and can be used as a primitive for implementing other looping constructs. In Python that would simply fail. The same principle applies here. The difference is that collecting garbage on the heap is merely costly, instead of impossible, like in the case of collecting un-needed stack frames.Cesarean
@MarkVY You're letting the tail wag the dog. It's not: "We start with well known optimizations you can rely on, then this shapes the code style" (as in: "We know scheme has TCR, this has resulted in idiomatic scheme involving recursion to which TCR can be appleid"). You've got that reversed, at least for java. The right reasoning is: "This is idiomatic java. Therefore, optimisation will ensure that this is tackled". For example, if, somehow, a ton of existing java code wrote TCRable recursion? Then the JDK would evolve to deal with it well. But, TCR is not idiomatic, thus, don't write that.Adamant
Sure, but the cost of the isEmpty() is relatively easy to get worst-case bounds on - you still aren't getting it. There are an infinite number of plausible stories you can tell about performance. You have to take a step back and realize that the process of "Plausible sounding performance story with an easily applied change to the code that makes my code slightly worse but addresses the plausible story" inevitably results in slower code that is style-wise a disastrous, unmaintainable soup. Only then can you come back to this specific case and realize it's not a good idea at all.Adamant
Indeed, there are an infinite number of stories. I'm focusing on this one because I wrote the "obvious" code and during code review the reviewer chose to bring up this one in particular. Are you suggesting I dismiss his concern with "Proof or GTFO"? Surely not. He could just as easily dismiss my dismissal: "prove that this is NOT an issue". Then we are at an impasse. The correct thing to do in such situation is NOT resort to mantras such as "write idiomatic code". The correct thing to do here is try to learn more about the optimizer so we need not argue from a position of ignorance.Cesarean
Are you suggesting I dismiss his concern with "Proof or GTFO"? Yes, exactly. Harsh but it's the only way to deal with 'plausible performance stories' - there are a lot, following them leads to death by a thousand cuts, and most of these plausible stories are entirely incorrect, so a highly skeptical take is the correct approach. You can't prove a negative, so if that's how your reviewer responds, it's them being a well-intended idiot. The burden of proof lies with them, not you. If they really wanna go there, well, that blog post you linked to, that'd be a good start for this specific case.Adamant
I can finally verbalize what I agree and disagree with about your position. If, hypothetically, I were writing code that is meant to be portable across many JVMs, then this might be a reasonable approach. But if I'm writing code that targets a specific JVM running on a specific CPU then assuming I care about performance at all, then I need to become as much of a performance expert as possible, across as many layers of my "stack" as possible, ranging from knowing when escape analysis is likely to succeed to whether a branch mis-prediction is more costly than a cache miss. (see next comment).Cesarean
And you were kind enough to warn that I meddling with forces beyond my ken, and I should back off before get hurt. Or in your words: either spend several years becoming an expert, or just write idiomatic code. The trouble is, there's no such thing as a " performance expert" just as there's no such thing as a "tall person". Some people are taller than others; if someone is 8 feet tall we can safely say they qualify as a "tall person". But there's no sharp cutoff. Similarly, the way to become a performance expert is to gradually learn more and more. (see next comment)Cesarean
Yes, it will take a long time to learn "enough" to be a "true expert". And until someone reaches that point, they should not trust their own "intuitions" about performance, nor should they trust any measurements they have made, because they might not have corrected for the right things. But that's no reason to discourage someone from trying to learn more! And that's what your response unfortunately amounts to: "do not bother trying to acquire the knowledge you seek, for you know not how to wield it anyway". (see next comment)Cesarean
But if everyone followed such advice, there would be a lot fewer experts. Because the only way to become one would be to commit to spending a few years before one is allowed to learn anything about performance at all. Surely that's not your intention?Cesarean
I downvoted your answer, which I basically never do on this site. I feel weird doing so, because you clearly spent a lot of time and effort in the answer itself and in the comments trying to get it through my thick skull. But I feel the attitude behind your answer is actively harmful, in because it basically discourages further learning, which is not what a question-and-answer site should be doing. The JVM is not magic beyond mortal ken. Java has a weak performance model relative to assembly, but it doesn't take 10 years before one can have any opinions about what's faster or slower.Cesarean
The intent was the opposite, actually: The proper shortcut to well-performing code is to just write idiomatic java and not try to half-arse it. There is no easy way to become a performance expert, no 'just apply these 30 rote rules of thumb and that is good enough'. No, only in-depth understanding (as Heinlein would say, 'grokking' large parts of how VMs and various data structures work internally, and an excellent understanding of the vagaries of profilers) would let you move beyond 'do not trust your gut or hearsay, just write idiomatic code'. That shouldn't discourage you; on the contrary!Adamant
After all, often code, even written entirely idiomatically, either flat out won't work if it doesn't perform better given the use cases, or would cost dearly if it isn't improved. The only useful road forward then is to take a lot of time and figure it all out, such as by understanding the post you linked to. It's worthwhile to invest the time, because the alternative (some reviewer tosses out: Hey, you can save an allocation here, surely that'll fix our problems because I read somewhere allocations are bad) just doesn't work, not even close. There is no alternative to the expertise.Adamant
V
4

Scalar Replacement is indeed a kind of optimization that you can never be absolutely certain about, because it depends on too many factors.

First, an allocation may be eliminated only when all uses of the instance are inlined in one compilation unit. In case of an iterator, it means that the iterator constructor, hasNext and next calls (including the nested calls) must be inlined.

public E next() {
    if (! hasNext())
        throw new NoSuchElementException();
    return (E) snapshot[cursor++];
}

However, inlining itself is a fragile optimization in HotSpot, since it relies on many heuristics and limits. For example, it may happen that iterator.next() call is not fully inlined into the loop due to reaching the maximum inlining depth, or because the outer compilation is already too big.

Second, scalar replacement does not happen if a reference conditionally receives different values.

for(Thing thing : things) do_something_with(thing);

In your example, if things is sometimes ArrayList and sometimes Collections.emptyList(), the iterator will be allocated on the heap. For elimination to happen, the type of the iterator must always be the same.

There are more examples in a great talk on Scalar Replacement by Ruslan Cheremin (it's in Russian, but YouTube's subtitles translation feature to the rescue).

Another recommended read is Aleksey Shipilёv's blog post, which also demonstrates how to use JMH to verify whether Scalar Replacement happens in a particular scenario or not.

In short, in simple cases like yours, there is a high chance that allocation elimination will work as expected. There may be some marginal cases though, as I mentioned above.

There is a recent discussion on hotspot-compiler-dev mailing list regarding Partial Escape Analysis proposal. If implemented, it could significantly extend the applicability of Scalar Replacement optimization.

V2 answered 2/6, 2021 at 1:4 Comment(2)
Thanks for the talk! I can't really read Russian but my spoken Russian is pretty okay. Will try watching that soon. So it seems that if the type of the collection and hence the iterator is statically known, and further if I'm confident that the relevant calls are in-lined, then the odds are in my favor.Cesarean
Finally watched most of the talk; did not realize that Arrays.asList() breaks escape analysis in Java 8... luckily it's fixed in 9. Yet more proof that inner classes should be static by default :)Cesarean

© 2022 - 2024 — McMap. All rights reserved.