Is there any way to get Node.JS and V8 to automatically vectorize simple loops?
Asked Answered
A

1

9

I'm currently working as an author and contributor to SSIM.js and jest-image-snapshot on behalf of Not a Typical Agency. A lot of the work I'm performing results in the creation of new algorithms to compare images in Javascript. After a lot of research and testing with AssemblyScript and WebAssembly, I've found that I can often get a higher-performing and more portable solutions with pure JS than I do with either of these two technologies. However, that only happens after performing an extensive code review of the generated assembly and performing many experiments.

What I'd like to understand is if there's a way to get Node.JS/libV8 to automatically vectorize sections of code. As an example, I have a two-pass loop that calculates prefix sums for each pixel in an image horizontally then vertically. Skipping over the horizontal prefix sum (which can be challenging to vectorize for a real performance improvement with pure assembly), the vertical prefix sum should be super simple to optimize. Here's an example:

    for (let h = 0; h + 1 < height; ++h) {
      for (let w = 0; w < width; ++w) {
        let above = pSumArray[h * width + w];
        let current = pSumArray[(h + 1) * width + w];
        pSumArray[(h + 1) * width + w] = above + current;
      }
    }

This takes all of the preexisting horizontal prefix sum calculations, and adds adjacent lines in the image going forward, one line at a time, all the way until it reaches the end.

The assembler output looks something like this:

0x11afd3a4adb1   1d1  c4a17b1044c10f vmovsd xmm0,[rcx+r8*8+0xf]
0x11afd3a4adb8   1d8  4c8bc2         REX.W movq r8,rdx
0x11afd3a4adbb   1db  4503c6         addl r8,r14
0x11afd3a4adbe   1de  0f80ce020000   jo 0x11afd3a4b092  <+0x4b2>
0x11afd3a4adc4   1e4  443bc7         cmpl r8,rdi
0x11afd3a4adc7   1e7  0f83d1020000   jnc 0x11afd3a4b09e  <+0x4be>
0x11afd3a4adcd   1ed  c4a17b104cc10f vmovsd xmm1,[rcx+r8*8+0xf]
0x11afd3a4add4   1f4  c5fb58c1       vaddsd xmm0,xmm0,xmm1
0x11afd3a4add8   1f8  c4a17b1144c10f vmovsd [rcx+r8*8+0xf],xmm0

If you look closely, you can see it's performing all of the load, store, and add operations with 'sd' (single double) operations. This means that it's only working with one double at a time. The documentation on this can be found here.

This is relevant in this application since it's a relative certainty that width is a multiple of 2 if not 16. As a side effect, we could be performing twice as many adds at a time if we're on a machine that only supports SSE2 instructions, and up to 8 adds at a time if the machine supports AVX512. Even though node.js and libv8 seem to do a decent job of runtime detecting CPUs, I still haven't found a way to get it to automatically vectorize a loop like this. I've tried a handful of strategies that include specific conditionals (e.g. is width divisible by 8) and combining them with delooping (e.g. array[0] =above0+current0 array[1]=above1+current1 array[2]=above2+current2...) but none have yielded any successful results.

Any help someone can provide me on this topic would be greatly appreciated.

Thanks so much,

Dan

Angelus answered 12/9, 2020 at 20:10 Comment(10)
It's checking for overflow of the index at every step (with jo after the addl). Is that the outer-loop logic? And you're not showing the actual inner loop that is just contiguous array += array? The outer loop is clearly pretty far from doing enough loop-invariant analysis to even hoist the overflow check. But maybe the inner loop asm is simpler.Carny
Note that in asm or C you can SIMD vectorize a prefix sum / horizontal scan. But if you don't have good support for manually vectorized code (like apparently in your case with JS) you don't have that option. At least doing multiple rows in parallel will hide some of the FP latency.Carny
@PeterCordes There were so many jumps in the code and it's unlabeled by code line so I'm not entirely sure what's going on. Sometimes I put in magic numbers in there just so I can see what line it's on. I'll add a gist or the full code somewhere else so you can experience the pain... ;-) With respect to your second comment, I know horizontal scan can be done in SIMD -- it's just not something I would expect a compiler to properly vectorize. To the extent you can vectorize it, it offers better performance. But not as much as one would hope. [It was like 1ms for every 2073600 bytes.]Angelus
@PeterCordes here is the full code and assembler output. You'll notice it's not exactly the same as the code pasted above, but that's because I just wanted a simpler example to show. In either case, you should be able to find the answer to your jo/addl question.Angelus
So that vmovsd / vaddsd / vmovsd sequence is actually inside a pretty large loop body from 1b0 to 210, with a lot of branching and integer moves. That's way worse than you'd expect for just looping over contiguous doubles incrementing r8 by 1. The add/jo and cmp/jnc you showed are both just overflow checks, not part of the useful part of the loop structure. No wonder it was confusing. But yeah, no wonder V8 is nowhere near vectorizing this. Unless this is not final optimized JIT output, but just a first pass? IIRC V8 TurboFan waits for profile feedback to find hot code to fully optimize.Carny
@PeterCordes if you look closely, there is more than one assembly in there. I would scroll from the bottom up.Angelus
I'm not that interested in wading through some huge mess of code you chose to link. It looks like the same code your snippet was taken from, and has an add r8, 1 which makes sense from the w++. If it's not the right code, link the line number or something if there's an actual optimized loop that's more interesting.Carny
@PeterCordes that's the right line. There's no more optimization than that. And that's after this ran through an end to end test with 181 images. Thanks for taking a look.Angelus
You could speedup your AssemblyScript code btw. Just rewrite it using unchecked as: for (let h = 0, stride = 0; h < height - 1; ++h) { for (let w = 0; w < width; ++w) { unchecked(pSumArray[stride + width + w] += pSumArray[stride + w]); } stride += width; }Isomeric
I tested that believe it or not and v8 turned out to be faster. It was really surprising but it makes the same unchecked optimization if you use standard Javascript Array (not typed). It'll detect if it ever reaches out of bounds, and if it determines it doesn't, it yields behavior like what you're suggesting.Angelus
E
7

V8 doesn't do any auto-vectorization currently. (Source: I'm a V8 developer.) This may or may not change in the future; people are toying with ideas every now and then, but I'm not aware of any specific plans.

Wasm-SIMD is getting close to general availability (it's currently in "origin trial" experimental limited release phase), which will allow you to use SIMD instructions via WebAssembly. (Wasm being much lower level than JavaScript means that it generally gives you better control over what instruction sequence will be generated. The Wasm-SIMD instruction set in particular has been chosen such that it maps pretty well onto common hardware instructions.)

Endive answered 12/9, 2020 at 21:29 Comment(5)
I've been playing a lot with Wasm and experimenting with Wasm SIMD, but I'm not convinced it's worth the effort to use. To put this in perspective, I'm evaluating what's the best most portable, and maintainable solution we can include in our open source projects.Angelus
This gives us three effective language choices. WAT itself, AssemblyScript, and C++ (emscripten). I don't know enough about WAT to make it worthwhile so I haven't been able to test it that way. AssemblyScript would seem like a natural option since it's supposedly derived from typescript, but it's really difficult to work with and the language conventions aren't up to snuff. Now if we switch to C++ and mm intrinsics, we're in a different language already and the intrinsics aren't guaranteed to be the most runtime efficient for a given CPU (or even use the native ones).Angelus
Unfortunately, all of this is compounded by the fact that WebAssembly SIMD's number of operations is a subset of what I can get elsewhere and perform portably. See libsimdpp for a really awesome (runtime detecting) instruction set. ... so if I'm stuck with C++. Why not then just build a C++ extension/module for Node?Angelus
If the difference between the Wasm-SIMD instruction set and the (possibly dynamically detected) full hardware capabilities really matters to you, then a C++ extension/module may well be the way to go for now. You can always re-evaluate when future versions of Wasm-Simd provide more instructions, depending on your experiences and needs.Endive
You're right. I just want it to work and for it to be easy, and unfortunately it hasn't been. I was really impressed when the scalar implementation shown in the gist performed faster with v8 than it did with assemblyscript -- by like 33-40%. That led me to believe that v8 might have some automatic vectorization somewhere. I think it's worth retrying with emscripten -- at least for some stuff like converting rgb2yuv or image resampling. I just don't know if the SIMD ops there will work everywhere the same even if wasm-feature-detect says SIMD is there.Angelus

© 2022 - 2024 — McMap. All rights reserved.