While investigating the subtler consequences of generational garbage collectors on application performance, I have hit a quite staggering discrepancy in the performance of a very basic operation – a simple write to a heap location – with respect to whether the value written is primitive or a reference.
The microbenchmark
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 1, time = 1)
@Measurement(iterations = 3, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class Writing
{
static final int TARGET_SIZE = 1024;
static final int[] primitiveArray = new int[TARGET_SIZE];
static final Object[] referenceArray = new Object[TARGET_SIZE];
int val = 1;
@GenerateMicroBenchmark
public void fillPrimitiveArray() {
final int primitiveValue = val++;
for (int i = 0; i < TARGET_SIZE; i++)
primitiveArray[i] = primitiveValue;
}
@GenerateMicroBenchmark
public void fillReferenceArray() {
final Object referenceValue = new Object();
for (int i = 0; i < TARGET_SIZE; i++)
referenceArray[i] = referenceValue;
}
}
The results
Benchmark Mode Thr Cnt Sec Mean Mean error Units
fillPrimitiveArray avgt 1 6 1 87.891 1.610 nsec/op
fillReferenceArray avgt 1 6 1 640.287 8.368 nsec/op
Since the whole loop is almost 8 times slower, the write itself is probably more than 10 times slower. What could possibly explain such a slowdown?
The speed of writing out the primitive array is more than 10 writes per nanosecond. Perhaps I should ask the flip-side of my question: what makes primitive writing so fast? (BTW I've checked, the times scale linearly with array size.)
Note that this is all single-threaded; specifying @Threads(2)
will increase both measurements, but the ratio will be similar.
A bit of background: the card table and the associated write barrier
An object in the Young Generation could happen to be reachable only from an object in the Old Generation. To avoid collecting live objects, the YG collector must know about any references that were written to the Old Generation area since the last YG collection. This is achieved with a sort of "dirty flag table", called the card table, which has one flag for each block of 512 bytes of heap.
The "ugly" part of the scheme comes when we realize that each and every write of a reference must be accompanied by a card table invariant-maintaining piece of code: the location in the card table which guards the address being written to must be marked as dirty. This piece of code is termed the write barrier.
In specific machine code, this looks as follows:
lea edx, [edi+ebp*4+0x10] ; calculate the heap location to write
mov [edx], ebx ; write the value to the heap location
shr edx, 9 ; calculate the offset into the card table
mov [ecx+edx], ah ; mark the card table entry as dirty
And this is all it takes for the same high-level operation when the value written is primitive:
mov [edx+ebx*4+0x10], ebp
The write barrier appears to contribute "just" one more write, but my measurements show that it causes an order-of-magnitude slowdown. I can't explain this.
UseCondCardMark
just makes it worse
There is a quite obscure JVM flag which is supposed to avoid the card table write if the entry is already marked dirty. This is important primarily in some degenerate cases where a lot of card table writing causes false sharing between threads via CPU caches. Anyway, I tried with that flag on:
with -XX:+UseCondCardMark:
Benchmark Mode Thr Cnt Sec Mean Mean error Units
fillPrimitiveArray avgt 1 6 1 89.913 3.586 nsec/op
fillReferenceArray avgt 1 6 1 1504.123 12.130 nsec/op
call
, probably to native code ofArrays.fill
. Seems HotSpot was able to prove everything it needed to do that. – RaquelraquelaArrays.fill
to be smart with the write barrier and write a few bytes only (possibly using anotherArray.fill
). – Endaendallfill
calls perform almost the same as their loop counterparts, even though the reference version of loop isn't converted into a call. Explicitfill
is actually slower than the loop for the primitive case! If anything, this is even more mysterious. I understand that the write barrier must be immediate (for concerns of concurrency), but it should still be optimized to write only when actually moving to the next card. – RaquelraquelaArrays
is slightly faster for references and about the same otherwise. – EndaendallreferenceArray
andprimitiveArray
static? This will affect results with@Threads(2)
. – Armbruster@Setup(Iteration)
to generate lots of garbage to force more GC pressure. – Armbruster