What can explain the huge performance penalty of writing a reference to a heap location?

Asked 3/2, 2014 at 9:44 Answered 3/2, 2014 at 10:38

Solved java garbage-collection microbenchmark jmh

While investigating the subtler consequences of generational garbage collectors on application performance, I have hit a quite staggering discrepancy in the performance of a very basic operation – a simple write to a heap location – with respect to whether the value written is primitive or a reference.

The microbenchmark

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 1, time = 1)
@Measurement(iterations = 3, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class Writing
{
  static final int TARGET_SIZE = 1024;

  static final    int[] primitiveArray = new    int[TARGET_SIZE];
  static final Object[] referenceArray = new Object[TARGET_SIZE];

  int val = 1;
  @GenerateMicroBenchmark
  public void fillPrimitiveArray() {
    final int primitiveValue = val++;
    for (int i = 0; i < TARGET_SIZE; i++)
      primitiveArray[i] = primitiveValue;
  }

  @GenerateMicroBenchmark
  public void fillReferenceArray() {
    final Object referenceValue = new Object();
    for (int i = 0; i < TARGET_SIZE; i++)
      referenceArray[i] = referenceValue;
  }
}

The results

Benchmark              Mode Thr    Cnt  Sec         Mean   Mean error    Units
fillPrimitiveArray     avgt   1      6    1       87.891        1.610  nsec/op
fillReferenceArray     avgt   1      6    1      640.287        8.368  nsec/op

Since the whole loop is almost 8 times slower, the write itself is probably more than 10 times slower. What could possibly explain such a slowdown?

The speed of writing out the primitive array is more than 10 writes per nanosecond. Perhaps I should ask the flip-side of my question: what makes primitive writing so fast? (BTW I've checked, the times scale linearly with array size.)

Note that this is all single-threaded; specifying @Threads(2) will increase both measurements, but the ratio will be similar.

A bit of background: the card table and the associated write barrier

An object in the Young Generation could happen to be reachable only from an object in the Old Generation. To avoid collecting live objects, the YG collector must know about any references that were written to the Old Generation area since the last YG collection. This is achieved with a sort of "dirty flag table", called the card table, which has one flag for each block of 512 bytes of heap.

The "ugly" part of the scheme comes when we realize that each and every write of a reference must be accompanied by a card table invariant-maintaining piece of code: the location in the card table which guards the address being written to must be marked as dirty. This piece of code is termed the write barrier.

In specific machine code, this looks as follows:

lea   edx, [edi+ebp*4+0x10]   ; calculate the heap location to write
mov   [edx], ebx              ; write the value to the heap location
shr   edx, 9                  ; calculate the offset into the card table
mov   [ecx+edx], ah           ; mark the card table entry as dirty

And this is all it takes for the same high-level operation when the value written is primitive:

mov   [edx+ebx*4+0x10], ebp

The write barrier appears to contribute "just" one more write, but my measurements show that it causes an order-of-magnitude slowdown. I can't explain this.

`UseCondCardMark` just makes it worse

There is a quite obscure JVM flag which is supposed to avoid the card table write if the entry is already marked dirty. This is important primarily in some degenerate cases where a lot of card table writing causes false sharing between threads via CPU caches. Anyway, I tried with that flag on:

with  -XX:+UseCondCardMark:
Benchmark              Mode Thr    Cnt  Sec         Mean   Mean error    Units
fillPrimitiveArray     avgt   1      6    1       89.913        3.586  nsec/op
fillReferenceArray     avgt   1      6    1     1504.123       12.130  nsec/op

Raquelraquela answered 3/2, 2014 at 9:44 Comment(15)

Interesting. Could you verify that there are no differences in bounds checking or loop unrolling or whatever? – Endaendall 3/2, 2014 at 17:37

@Endaendall There is in fact a much larger difference which I missed because I was probably looking at the OSR version instead of the regular one: the entire loop is actually replaced by a single call, probably to native code of Arrays.fill. Seems HotSpot was able to prove everything it needed to do that. – Raquelraquela 3/2, 2014 at 19:33

Cool, but it raises a new questions: Why doesn't the same optimization work for objects? I'd hope for Arrays.fill to be smart with the write barrier and write a few bytes only (possibly using another Array.fill). – Endaendall 3/2, 2014 at 19:51

@Endaendall Just checked---explicit fill calls perform almost the same as their loop counterparts, even though the reference version of loop isn't converted into a call. Explicit fill is actually slower than the loop for the primitive case! If anything, this is even more mysterious. I understand that the write barrier must be immediate (for concerns of concurrency), but it should still be optimized to write only when actually moving to the next card. – Raquelraquela 3/2, 2014 at 19:59

Even with a rather dumb implementation it could be at worst twice as slow: Writing all the barriers first and then filling the array. My results differ: Using Arrays is slightly faster for references and about the same otherwise. – Endaendall 3/2, 2014 at 20:36

@Endaendall Yes, actually it rather surprised me that the write barrier doesn't happen before the write. There could be a concern we are not aware of... – Raquelraquela 3/2, 2014 at 20:37

I guess that safepoints can explain it. Even the concurrent GCs have some short phase depending on them, so you don't need to worry what happens in-between. – Endaendall 3/2, 2014 at 20:45

Why are referenceArray and primitiveArray static? This will affect results with @Threads(2). – Armbruster 4/2, 2014 at 10:34

@AlekseyShipilev That was actually my point: to test what happens when collisions occur between threads. But concurrency issues are not my main concern here. I would be glad to have a clear answer in the case of a single thread. – Raquelraquela 4/2, 2014 at 10:36

@AlekseyShipilev In fact, an even more important reason to have them static is to ensure the arrays themselves rest in the Old Generation. I want to avoid the possibility of write barriers being optimized away. – Raquelraquela 4/2, 2014 at 10:37

@MarkoTopolnik By best guess is that write barriers prevent the effective loop pipelining. I saw your thread at hotspot-compiler-dev@, let me see if I have time to dig into that. – Armbruster 4/2, 2014 at 10:37

@MarkoTopolnik, well, if you run the benchmark long enough, the arrays will inevitably promote. Or, you can have the @Setup(Iteration) to generate lots of garbage to force more GC pressure. – Armbruster 4/2, 2014 at 10:39

@MarkoTopolnik, I think, one thing you are missing. Storing reference in an array requires type check, which is not performed in case of primitives. This kind of check should not be free. docs.oracle.com/javase/7/docs/api/java/lang/… – Emerald 5/2, 2014 at 7:58

May be this could help in investigation: books.google.co.uk/… – Emerald 5/2, 2014 at 8:10

@Emerald HotSpot has an easy time optimizing away the type check, as I have convinced myself by looking at the assembly output. Of course, even with the type check, it would be very cheap, a lot cheaper than another write to the card table. – Raquelraquela 5/2, 2014 at 8:47

Quoting the authoritative answer provided by Vladimir Kozlov at `hotspot-compiler-dev` mailing list:

Hi Marko,

For primitive arrays we use handwritten assembler code which use XMM registers as vectors for initialization. For object arrays we did not optimize it because it is not common case. We can improve it similar to what we did for arracopy but we decided leave it for now.

Regards,
Vladimir

I have also wondered why the optimized code is not inlined, and got that answer as well:

The code is not small, so we decided to not inline it. Look on MacroAssembler::generate_fill() in macroAssembler_x86.cpp:

http://hg.openjdk.java.net/hsx/hotspot-main/hotspot/file/54f0c207dc35/src/cpu/x86/vm/macroAssembler_x86.cpp

My original answer:

I missed an important bit in the machine code, apparently because I was looking at the On-Stack Replacement version of the compiled method instead of the one used for subsequent calls. It turns out that HotSpot was able to prove that my loop amounts to what a call to Arrays.fill would have done and replaced the entire loop with a call instruction to such code. I can't see that function's code, but it probably uses every possible trick, such as MMX instructions, to fill a block of memory with the same 32-bit value.

This gave me the idea to measure the actual Arrays.fill calls. I got more surprise:

Benchmark                  Mode Thr    Cnt  Sec         Mean   Mean error    Units
fillPrimitiveArray         avgt   1      5    2      155.343        1.318  nsec/op
fillReferenceArray         avgt   1      5    2      682.975       17.990  nsec/op
loopFillPrimitiveArray     avgt   1      5    2      156.114        0.523  nsec/op
loopFillReferenceArray     avgt   1      5    2      682.209        7.047  nsec/op

The results with a loop and with a call to fill are identical. If anything, this is even more confusing than the results which motivated the question. I would have at least expected fill to benefit from the same optimization ideas regardless of value type.

Raquelraquela answered 3/2, 2014 at 10:38 Comment(7)

There seems to be a remarkable amount of error in the timings for the reference array. Perhaps more runs would be worthwhile? – Commune 3/2, 2014 at 20:7

@user2357112: The benchmarking framework (JMH) is supposed to take care of it all. – Endaendall 3/2, 2014 at 20:32

@Commune I've repeated with 5 iterations of 2 seconds each, and with more warmup. The error is less and the results have converged to practically identical (loop == fill, but the gap between primitive and reference remains stable). See updated answer. – Raquelraquela 3/2, 2014 at 20:32

@Endaendall The repetition count is configurable, and I am using a very low setting for experimenting. It is good to use greater values to confirm when posting, though. – Raquelraquela 3/2, 2014 at 20:34

Sorry, I was simply assuming you used some fairly conservative settings as I usually do. Actually, I do nothing as the caliper's defaults are conservative enough (whenever I tried to be even more conservative I just confirmed the results). – Endaendall 3/2, 2014 at 20:43

@Endaendall jmh's defaults are overly conservative: 20 iterations for 5 seconds, for each tested method. That's why I always give my values. – Raquelraquela 3/2, 2014 at 21:3

AFAIK the loop gets recognized as a fill and results in identical code. This would explain why the results are pretty exactly the same. – Endaendall 15/1, 2020 at 15:27

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++