What is vmovdqu doing here?

Asked 16/5, 2018 at 16:36 Answered 17/5, 2018 at 9:25

Solved assembly x86 avx java-bytecode-asm jvm-hotspot

I have a Java loop that looks like this:

public void testMethod() {
    int[] nums = new int[10];
    for (int i = 0; i < nums.length; i++) {
        nums[i] = 0x42;
    }
}

The assembly I get is this:

0x00000001296ac845: cmp    %r10d,%ebp
0x00000001296ac848: jae    0x00000001296ac8b4
0x00000001296ac84a: movl   $0x42,0x10(%rbx,%rbp,4)
0x00000001296ac852: inc    %ebp               
0x00000001296ac854: cmp    %r11d,%ebp
0x00000001296ac857: jl     0x00000001296ac845  

0x00000001296ac859: mov    %r10d,%r8d
0x00000001296ac85c: add    $0xfffffffd,%r8d
0x00000001296ac860: mov    $0x80000000,%r9d
0x00000001296ac866: cmp    %r8d,%r10d
0x00000001296ac869: cmovl  %r9d,%r8d
0x00000001296ac86d: cmp    %r8d,%ebp
0x00000001296ac870: jge    0x00000001296ac88e
0x00000001296ac872: vmovq  -0xda(%rip),%xmm0                                                    
0x00000001296ac87a: vpunpcklqdq %xmm0,%xmm0,%xmm0
0x00000001296ac87e: xchg   %ax,%ax

0x00000001296ac880: vmovdqu %xmm0,0x10(%rbx,%rbp,4)  
0x00000001296ac886: add    $0x4,%ebp          
0x00000001296ac889: cmp    %r8d,%ebp
0x00000001296ac88c: jl     0x00000001296ac880

If my understanding is correct, the first block of assembly is the one which does nums[i] = 0x42;. In the third block, there's vmovdqu which

The vmovdqu instruction moves values from an integer vector to an unaligned memory location.

However, I still don't fully understand what vmovdqu is doing in context of my loop.

What exactly is the third block of assembly code doing?

The complete code is available here: https://pastebin.com/cT5cJcMS

Shinny answered 16/5, 2018 at 16:36 Comment(12)

@Sneftel ah, I elided it for sake of brevity. I'll add the actual assembly code. :) – Shinny 16/5, 2018 at 16:39

This does look the result of auto-vectorization, but you should really post the whole code, several important parts are missing. – Cavalcade 16/5, 2018 at 16:53

@MatteoItalia Updated the post with link to complete code :) – Shinny 16/5, 2018 at 16:58

vmovq-load + vpunpcklqdq is a sub-optimal way to broadcast! If you're going to load a constant from memory, you should use vbroadcastss (mem), %xmm0. That costs the same as a vmovd scalar load; the broadcasting is free inside the load port on all CPUs that have AVX. – Holily 16/5, 2018 at 23:49

Also, is it really loading data from only 0xda bytes before this code? Normally there's no advantage to keeping code an data in the same page, because modern x86 uses separate iTLB and dTLB. With AVX2, I'd have put 0x42 in a register and used vmovd %eax, %xmm0 / vpbroadcastd %xmm0, %xmm0 (or vpshufd), instead of loading a constant from memory. And replicating 0x42 twice in memory makes no sense, unless their JIT optimizer just sucks at shuffles. I guess we do have to account for the fact that a JIT has to compile fast and doesn't take the time to make more optimal code. – Holily 16/5, 2018 at 23:51

@PeterCordes But it's only a single access. It makes sense that the compiler decided to not allocate a register for it. Also note that the compiler decided to not use any of the callee-saved registers. I'm not sure why the compiler didn't choose VBROADCAST though. – Synaeresis 17/5, 2018 at 0:4

@HadiBrais: (The compiler does use RBX and RBP already, so yeah presumably there's register pressure and RAX/RCX/RDX/... are busy). Loading static data is a valid choice for constants in vectors, because (unlike ARM or PowerPC) x86 lacks any instructions that put immediates into vectors. But it does need 0x42 inside the scalar loop, so it could have used r8d. mov $0x42, %r8d / vmov %r8d, %xmm0. A more obviously bad decision: an indexed addressing mode for the store defeats micro-fusion on Sandybridge/Ivybridge, although the loop will still run at 1 iteration per clock there. – Holily 17/5, 2018 at 0:15

The worst thing is that nums.length should be a compile-time constant if the new succeeds, then you can fill the array with 2 vmovdqu xmm + 1 vmovq xmm (fully unroll the loop). I wonder if it helps to use 10 as the loop bound? Or since nums isn't returned, the whole function should just be a ret, or just nothing after inlining. – Holily 17/5, 2018 at 0:17

@PeterCordes But it's using r8d in the loop itself. So it's not free. – Synaeresis 17/5, 2018 at 0:21

@HadiBrais: I meant do the vmovd %r8d, %xmm0 before you clobber r8d with mov %r10d,%r8d. r8d isn't used in the scalar loop, and it's clobbered after (at least on the path that leads to the vector loop). Oh, I see yet another missed optimization: mov %r10d,%r8d / add $0xfffffffd,%r8d should be an LEA. WTF is wrong with the peephole optimizer to not spot that? The whole block looks overcomplicated. – Holily 17/5, 2018 at 0:36

Tagged this jvm-hotspot based on the guess that you're using that JIT compiler. Please confirm that guess, or edit to correct it. – Holily 17/5, 2018 at 0:40

@PeterCordes I don't think there is a peephole optimizer and probably no instruction scheduler to move vmovd anywhere :) – Synaeresis 17/5, 2018 at 0:44

The optimizer has chosen to vectorize your loop, setting 4 values per "iteration". (The instructions preceding the vmovdqu are fairly opaque, but presumably it's splatting 0x42 into all lanes of XMM0.) The "unaligned" variant is necessary because the array is not guaranteed to be SIMD-aligned in memory (after all, it's storing int32s, not int32x4s).

Allowable answered 16/5, 2018 at 16:46 Comment(7)

Ok, could you tell me what the need for the first loop was? (first block of assembly) :) – Shinny 16/5, 2018 at 17:8

The vectorized loop is only able to set exactly four elements at a time. The first block picks up the "spares", so that the remaining number of elements is a multiple of 4. – Allowable 16/5, 2018 at 17:9

So if I understand correctly, the first block would probably run 2 iterations and the remaining 8 elements would be processed by vmovdqu in chunks of 4. :) So in total the loop would get over in 4 iterations instead of 10? – Shinny 16/5, 2018 at 17:13

Exactly. Note that invoking all that machinery, just to set ten elements, is very silly. It's only doing it because it's not quite smart enough to figure out that nums.length will always be 10. – Allowable 16/5, 2018 at 17:14

So would this be considered loop unrolling? – Shinny 16/5, 2018 at 17:28

I wouldn't call it that. The body is not replicated multiple times. Vectorization is a better word. – Scorecard 16/5, 2018 at 17:38

@LittleChild: vectorization is loop unrolling + doing multiple iterations in parallel with SIMD. You're normally only call it "unrolling" if you unroll even more and do multiple vectors per asm loop iteration. – Holily 16/5, 2018 at 23:43

Your JIT compiler auto-vectorized your loop, storing 4 ints per asm iteration.

But it made asm that's overcomplicated and missed a lot of optimizations. I wonder if this is maybe only first-stage code-gen before the JIT compiler decided to fully optimize?

Your code doesn't return nums, so it's destroyed right after being created. After inlining, your function should optimize down to no instructions at all. Or as a stand-alone function, should be just a ret. Allocating memory and then letting it get garbage collected is not an observable side-effect that an optimizer needs to preserve.

Then, if new succeeds then nums.length will be 10. So the code could be as simple as

# %rbx holds a pointer to the actual data storage for nums[]
vbroadcastss  -0x????(%rip),%xmm0      # broadcast-load the 0x42 constant into xmm0
vmovdqu       %xmm0,   (%rbx)          # nums[0..3] : 16 bytes
vmovdqu       %xmm0, 16(%rbx)          # nums[4..7] : 16 bytes
vmovq         %xmm0, 32(%rbx)          # nums[8..9] : 8 bytes

Fully unrolling the loop makes the most sense here; setting up loop counters and so on takes more instructions and code-size than a couple stores. Especially when the size isn't a multiple of the vector width, the last partial vector has to be handled specially anyway.

BTW, if your size had been 11 instead of 10, you could either do 8 + 4 byte stores, or a 16-byte store that partially overlapped, e.g. 16-byte vmovdqu stores to (%rbx), 16(%rbx), and 28(%rbx) covering nums[7..11]. A final unaligned vector that ends at the end of the array is a common strategy when manually vectorizing (or in glibc's small-buffer handling for memcpy), but even ahead-of-time compilers don't seem to use it.

Other obvious missed optimizations:

vmovq load + vpunpcklqdq to broadcast. With AVX available, vbroadcastss is by far the best way to broadcast a 32-bit constant from memory. One instruction with no ALU uop required. Maybe the JIT compiler doesn't actually know about new AVX instructions?
mov %r10d,%r8d + add $-3,%r8d: this should obviously be an lea -3(%r10), %r8d.

It's not clear what the starting value of %ebp is supposed; if the JVM is slicing of chunks of a buffer somewhere so RBX isn't the base of the array, then maybe EBP isn't 0 before the scalar loop? IDK why the loop bound of the scalar loop is in a register, instead of an immediate though.

It's odd to put static data in the same page as code (-0xda(%rip) is still in the same page). There's no big penalty, but it means the same page will need to be in the iTLB and dTLB, so you're covering less total code + data than if you used separate pages. Not a huge deal with 2M hugepages, though. The shared 2nd-level TLB is a victim cache (IIRC), so the iTLB miss that populates it probably won't help the vmovq load get a TLB hit. It will probably do a 2nd page walk.

I don't know why even good ahead-of-time C compilers like gcc and clang overcomplicate this so much, for a loop over an array with unknown alignment and length.

void set42(int *nums, unsigned long int len) {
    for (unsigned long int i=0 ; i<len ; i++ ) {
        *nums++ = 0x42;
    }
}

This is what I'd do by hand, for 128-bit vectors without loop unrolling (and optimistically assuming that it's not worth reaching an alignment boundary, like your JIT, and like clang and gcc8 and higher):

# x86-64 System V calling convention: int*nums in RDI,  len in RSI
set42:
    cmp           $4, %rsi
    jb          .Lsmall_count

    lea          -16(%rdi, %rsi,4), %rdx    # pointer to end-16, the start of the final vector store
    vbroadcastss constant(%rip), %xmm0
 .p2align 4
 .Lvector:                          # do {
    vmovdqu      %xmm0, (%rdi)
    add          $16, %rdi          # nums += 4 elements
    cmp          %rdx, %rdi
    jb           .Lvector           # while(nums < end-16);

    # only reached for sizes >= 16 bytes so we can always store a full possibly-overlapping final vector
    # for len = 16, this results in 2 stores to the same address, but that's cheaper than extra branches even if len=16 is common
    vmovdqu      %xmm0, (%rdx)   # final potentially-overlapping vector
    ret

.Lsmall_count:
    test         %rsi,%rsi
    jz          .Ldone
   # some compilers will fully unroll this with a chain of branches
   # maybe worth doing if small inputs are common
  .Lscalar:                       # do {
    movl        0x42, (%rdi)
    add         $4, %rdi          # *num++ = 0x42;
    dec         %rsi
    jnz                           # }while(--len);
  # a more sophisticated cleanup strategy using SIMD is possible, e.g. 8-byte stores,
  # but I haven't bothered.

.Ldone:
    ret

Notice that for len>=4, there's one fall-through branch at the top, then only the loop branch. Total overhead is 1 macro-fused cmp/jcc, 1 broadcast load, and 1 lea. The loop is 3 uops with a non-indexed addressing mode.

AFAIK, compilers don't know how to use a possibly-overlapping last vector effectively. It's much better than scalar cleanup most of the time time. Notice that for len=4 (16 bytes), we do the same vector store twice. But for len=8 (32 bytes), the loop exits after the first iteration, so we still only do 2 total stores. i.e. with any exact multiple of the vector width other than 1, we don't do an overlapping store. Branching the same way for len=4 and len=8 is actually nice for branch prediction.

Even good ahead-of-time C compilers make this super-complicated, as you can see on the Godbolt compiler explorer. Some of clang's complexity comes from unrolling more; clang6.0 unrolls a huge number of times. (I chose the compiler versions and options that led to the least complicated code. gcc7.3 and clang6.0 emit much larger functions for this.)

gcc7 and earlier go scalar until an alignment boundary, then use aligned vector stores. This can be good if you expect the pointer to be misaligned often, but saving instructions to make the aligned case even cheaper is good when it usually is, and the penalty for misaligned stores is low.

Holily answered 17/5, 2018 at 9:25 Comment(0)

Allowable answered 16/5, 2018 at 16:46 Comment(7)

Ok, could you tell me what the need for the first loop was? (first block of assembly) :) – Shinny 16/5, 2018 at 17:8

So would this be considered loop unrolling? – Shinny 16/5, 2018 at 17:28

I wouldn't call it that. The body is not replicated multiple times. Vectorization is a better word. – Scorecard 16/5, 2018 at 17:38

The compiler has unrolled the loop to enable vectorization.

// 10d holds the length of the array and ebp holds the loop index.
0x00000001296ac845: cmp    %r10d,%ebp
// This branch is only taken when the loop index `i` is larger or equal to `nums.length`.
0x00000001296ac848: jae    0x00000001296ac8b4
// Performs a single iteration.
0x00000001296ac84a: movl   $0x42,0x10(%rbx,%rbp,4)
// Increment the loop index.
0x00000001296ac852: inc    %ebp
// r11d contains some constant. This is just to ensure that the number of any remaining iterations is multiple of 4.             
0x00000001296ac854: cmp    %r11d,%ebp
// This branch is NOT taken (falls through) only when either zero iterations are left of when the number of remaining iterations is a multiple of 4.
0x00000001296ac857: jl     0x00000001296ac845  
// These instructions make sure that the loop index does not overflow.
0x00000001296ac859: mov    %r10d,%r8d
0x00000001296ac85c: add    $0xfffffffd,%r8d
0x00000001296ac860: mov    $0x80000000,%r9d
0x00000001296ac866: cmp    %r8d,%r10d
0x00000001296ac869: cmovl  %r9d,%r8d
// The next two instructions check whether there are any remaining iterations.
0x00000001296ac86d: cmp    %r8d,%ebp
0x00000001296ac870: jge    0x00000001296ac88e
// If we reach here, the number of remaining iterations must be a multiple of 4.
// Initialize xmm0 with 4 copies of 0x42.
0x00000001296ac872: vmovq  -0xda(%rip),%xmm0                                                    
0x00000001296ac87a: vpunpcklqdq %xmm0,%xmm0,%xmm0
// This is a NOP just to align the loop on a 64-byte cache line boundary for performance.
0x00000001296ac87e: xchg   %ax,%ax
// Vectorized 4 iterations of the loop.
0x00000001296ac880: vmovdqu %xmm0,0x10(%rbx,%rbp,4)  
0x00000001296ac886: add    $0x4,%ebp          
0x00000001296ac889: cmp    %r8d,%ebp
0x00000001296ac88c: jl     0x00000001296ac880
// All iterations have been executed at this point.

Synaeresis answered 16/5, 2018 at 23:24 Comment(3)

I assume the JIT would only try to align loop branches to 8 or 16 bytes. It just happens that it's also a 64-byte boundary; it would be extremely unusual and wasteful to align loops to 64 bytes in general. (I sometimes do that in a microbenchmark just to make doubly sure there aren't any confounding factors from the front-end, but normal compilers don't.) – Holily 16/5, 2018 at 23:54

@PeterCordes Yeah I'm not sure whether the 64-byte alignment was intentional or not. – Synaeresis 16/5, 2018 at 23:58

I'm close enough to sure it isn't intentional :P Loop buffers on modern CPUs make alignment for small loops unimportant. (Although Skylake with updated microcode doesn't have a loop buffer, and neither does Kaby lake, so splitting a tiny loop into two uop cache lines can make it issue at half speed. gcc aligns to at most 16 bytes, though, and then only if it will take less than 10 bytes of padding, or something like that.) – Holily 17/5, 2018 at 0:2

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags