I see that Duff's device is just to do loop unrolling in C.
https://en.wikipedia.org/wiki/Duff%27s_device
I am not sure why it is still useful nowadays. Isn't that the compiler should be smart enough to do loop-unrolling?
I see that Duff's device is just to do loop unrolling in C.
https://en.wikipedia.org/wiki/Duff%27s_device
I am not sure why it is still useful nowadays. Isn't that the compiler should be smart enough to do loop-unrolling?
Compilers are good at loop unrolling, but sometimes "obvious" optimizations can be suppressed when the compiler can't prove that it's right. In the Duff's Device case, the target was a memory-mapped register, and the source was an arbitrary pointer. Today, the memory-mapped register would likely have to be tagged as volatile
and it's not clear whether the compiler could determine whether the source and destination pointers could ever alias. Either of these might possibly inhibit the optimization.
Things like memcpy (which is similar to but different than Duff's device) are often "special" functions known to the compiler that may have multiple hand-optimized variants built in. Expecting the compiler to generate memcpy from "first principles" may not produce as highly an optimized version as you might expect.
Duff's Device is not just about loop unrolling but how to handle the excess copies without an extra loop. This saves code space, which is probably less of an issue now. Whether compilers do the equivalent thing when loop unrolling--I don't know.
Is it useful? Possibly, in certain rare cases. That was arguably true when Duff's Device was originally invented as well.
Loop unrolling is useful in asm, and sometimes in C when the compiler doesn't do it for you.
But Duff's Device specifically, nesting a do{}while
inside a switch{}
to do an indirect jump to any of 8 loop entry points, usually not useful. Possibly if you're optimizing for a machine that can increment pointers for free, e.g. pre- or post-increment addressing as part of a load (like ARM or PowerPC, but not x86 or RISC-V). Having multiple entry points into a loop means each pointer increment still has to get done separately, instead of add rdi, 64
for ptr+=8;
and using addressing modes like [rdi+8]
, [rdi+16]
to address ptr[1]
, ptr[2]
, etc.
It also constrains how the unrolled loop can be optimized, e.g. the compiler can't schedule the loads early to hide more load latency. This wasn't a thing on early computers where CPUs weren't pipelined. This can also impact auto-vectorization.
So it's generally not great even for an ISA that could increment pointers for free, except maybe on simple microcontrollers. Optimal asm for CPUs usually involves some code outside the loop to check if the iteration count is greater than unroll_count
, otherwise use a fallback loop, especially when the total count is usually large so the remainder isn't a huge deal. Extra code outside the loop is usually better than trying to have 4 or 8 possible loop entry points that you select with a jump table or a chain or tree of compare/branch.
Whether you can get the compiler to unroll nicely for you or not depends on the code and the compiler. Clang usually does a pretty good job, including turning sum += arr[i]
into multiple accumulators (the asm equivalent of sum0 += arr[i+0];
sum1 += arr[i+1];
etc.) especially with higher-latency operations like floating-point. GCC is normally terrible at that, unrolling but still adding into the same one accumulator (or single SIMD vector of accumulators), even with -O3 -ffast-math -funroll-loops
, so manual vectorization and unrolling is more necessary with GCC for something like an FP dot product. But manually unrolling with multiple scalar float
accumulators in C can be counter-productive if it stops the compiler from seeing the pattern and auto-vectorizing.
(GCC -O3
doesn't enable loop unrolling by default, only with PGO: -fprofile-generate
/ -fprofile-use
.)
Related: tricks for manually vectorizing and handling an odd number of elements. That's a very similar problem to manually unrolling, except that doing 4, 8, or 16 elements or whatever for the price of 1 has implications for cleanup strategies to still use vectors. Especially with narrow elements and wide vectors, so the remainder is potentially a lot of elements for a scalar fallback; you might do some cleanup with narrower vectors.
Duff’s device and other tricks us old guys used to shave machine cycles off functions, interrupts , and overall code is still relevant when the code runs for cpu years.
If you don’t have time to do right it seems you always find time to do it over. Yea
© 2022 - 2024 — McMap. All rights reserved.