Your JIT compiler auto-vectorized your loop, storing 4 int
s per asm iteration.
But it made asm that's overcomplicated and missed a lot of optimizations. I wonder if this is maybe only first-stage code-gen before the JIT compiler decided to fully optimize?
Your code doesn't return nums
, so it's destroyed right after being created. After inlining, your function should optimize down to no instructions at all. Or as a stand-alone function, should be just a ret
. Allocating memory and then letting it get garbage collected is not an observable side-effect that an optimizer needs to preserve.
Then, if new
succeeds then nums.length
will be 10
. So the code could be as simple as
# %rbx holds a pointer to the actual data storage for nums[]
vbroadcastss -0x????(%rip),%xmm0 # broadcast-load the 0x42 constant into xmm0
vmovdqu %xmm0, (%rbx) # nums[0..3] : 16 bytes
vmovdqu %xmm0, 16(%rbx) # nums[4..7] : 16 bytes
vmovq %xmm0, 32(%rbx) # nums[8..9] : 8 bytes
Fully unrolling the loop makes the most sense here; setting up loop counters and so on takes more instructions and code-size than a couple stores. Especially when the size isn't a multiple of the vector width, the last partial vector has to be handled specially anyway.
BTW, if your size had been 11 instead of 10, you could either do 8 + 4 byte stores, or a 16-byte store that partially overlapped, e.g. 16-byte vmovdqu
stores to (%rbx)
, 16(%rbx)
, and 28(%rbx)
covering nums[7..11]
. A final unaligned vector that ends at the end of the array is a common strategy when manually vectorizing (or in glibc's small-buffer handling for memcpy
), but even ahead-of-time compilers don't seem to use it.
Other obvious missed optimizations:
vmovq
load + vpunpcklqdq
to broadcast. With AVX available, vbroadcastss
is by far the best way to broadcast a 32-bit constant from memory. One instruction with no ALU uop required. Maybe the JIT compiler doesn't actually know about new AVX instructions?
mov %r10d,%r8d
+ add $-3,%r8d
: this should obviously be an lea -3(%r10), %r8d
.
It's not clear what the starting value of %ebp
is supposed; if the JVM is slicing of chunks of a buffer somewhere so RBX isn't the base of the array, then maybe EBP isn't 0 before the scalar loop? IDK why the loop bound of the scalar loop is in a register, instead of an immediate though.
It's odd to put static data in the same page as code (-0xda(%rip)
is still in the same page). There's no big penalty, but it means the same page will need to be in the iTLB and dTLB, so you're covering less total code + data than if you used separate pages. Not a huge deal with 2M hugepages, though. The shared 2nd-level TLB is a victim cache (IIRC), so the iTLB miss that populates it probably won't help the vmovq
load get a TLB hit. It will probably do a 2nd page walk.
I don't know why even good ahead-of-time C compilers like gcc and clang overcomplicate this so much, for a loop over an array with unknown alignment and length.
void set42(int *nums, unsigned long int len) {
for (unsigned long int i=0 ; i<len ; i++ ) {
*nums++ = 0x42;
}
}
This is what I'd do by hand, for 128-bit vectors without loop unrolling (and optimistically assuming that it's not worth reaching an alignment boundary, like your JIT, and like clang and gcc8 and higher):
# x86-64 System V calling convention: int*nums in RDI, len in RSI
set42:
cmp $4, %rsi
jb .Lsmall_count
lea -16(%rdi, %rsi,4), %rdx # pointer to end-16, the start of the final vector store
vbroadcastss constant(%rip), %xmm0
.p2align 4
.Lvector: # do {
vmovdqu %xmm0, (%rdi)
add $16, %rdi # nums += 4 elements
cmp %rdx, %rdi
jb .Lvector # while(nums < end-16);
# only reached for sizes >= 16 bytes so we can always store a full possibly-overlapping final vector
# for len = 16, this results in 2 stores to the same address, but that's cheaper than extra branches even if len=16 is common
vmovdqu %xmm0, (%rdx) # final potentially-overlapping vector
ret
.Lsmall_count:
test %rsi,%rsi
jz .Ldone
# some compilers will fully unroll this with a chain of branches
# maybe worth doing if small inputs are common
.Lscalar: # do {
movl 0x42, (%rdi)
add $4, %rdi # *num++ = 0x42;
dec %rsi
jnz # }while(--len);
# a more sophisticated cleanup strategy using SIMD is possible, e.g. 8-byte stores,
# but I haven't bothered.
.Ldone:
ret
Notice that for len>=4
, there's one fall-through branch at the top, then only the loop branch. Total overhead is 1 macro-fused cmp/jcc, 1 broadcast load, and 1 lea
. The loop is 3 uops with a non-indexed addressing mode.
AFAIK, compilers don't know how to use a possibly-overlapping last vector effectively. It's much better than scalar cleanup most of the time time. Notice that for len=4 (16 bytes), we do the same vector store twice. But for len=8 (32 bytes), the loop exits after the first iteration, so we still only do 2 total stores. i.e. with any exact multiple of the vector width other than 1, we don't do an overlapping store. Branching the same way for len=4 and len=8 is actually nice for branch prediction.
Even good ahead-of-time C compilers make this super-complicated, as you can see on the Godbolt compiler explorer. Some of clang's complexity comes from unrolling more; clang6.0 unrolls a huge number of times. (I chose the compiler versions and options that led to the least complicated code. gcc7.3 and clang6.0 emit much larger functions for this.)
gcc7 and earlier go scalar until an alignment boundary, then use aligned vector stores. This can be good if you expect the pointer to be misaligned often, but saving instructions to make the aligned case even cheaper is good when it usually is, and the penalty for misaligned stores is low.
vmovq
-load +vpunpcklqdq
is a sub-optimal way to broadcast! If you're going to load a constant from memory, you should usevbroadcastss (mem), %xmm0
. That costs the same as avmovd
scalar load; the broadcasting is free inside the load port on all CPUs that have AVX. – Holily0xda
bytes before this code? Normally there's no advantage to keeping code an data in the same page, because modern x86 uses separate iTLB and dTLB. With AVX2, I'd have put0x42
in a register and usedvmovd %eax, %xmm0
/vpbroadcastd %xmm0, %xmm0
(orvpshufd
), instead of loading a constant from memory. And replicating0x42
twice in memory makes no sense, unless their JIT optimizer just sucks at shuffles. I guess we do have to account for the fact that a JIT has to compile fast and doesn't take the time to make more optimal code. – Holily0x42
inside the scalar loop, so it could have usedr8d
.mov $0x42, %r8d
/vmov %r8d, %xmm0
. A more obviously bad decision: an indexed addressing mode for the store defeats micro-fusion on Sandybridge/Ivybridge, although the loop will still run at 1 iteration per clock there. – Holilynums.length
should be a compile-time constant if thenew
succeeds, then you can fill the array with 2vmovdqu xmm
+ 1vmovq xmm
(fully unroll the loop). I wonder if it helps to use10
as the loop bound? Or sincenums
isn't returned, the whole function should just be aret
, or just nothing after inlining. – Holilyvmovd %r8d, %xmm0
before you clobberr8d
withmov %r10d,%r8d
.r8d
isn't used in the scalar loop, and it's clobbered after (at least on the path that leads to the vector loop). Oh, I see yet another missed optimization:mov %r10d,%r8d
/add $0xfffffffd,%r8d
should be an LEA. WTF is wrong with the peephole optimizer to not spot that? The whole block looks overcomplicated. – Holilyjvm-hotspot
based on the guess that you're using that JIT compiler. Please confirm that guess, or edit to correct it. – Holilyvmovd
anywhere :) – Synaeresis