Is Duff's device still useful?

Asked 8/2, 2019 at 17:49 Answered 16/12, 2022 at 1:27

gcc clang compiler-optimization loop-unrolling duffs-device

I see that Duff's device is just to do loop unrolling in C.

https://en.wikipedia.org/wiki/Duff%27s_device

I am not sure why it is still useful nowadays. Isn't that the compiler should be smart enough to do loop-unrolling?

Leucocratic answered 8/2, 2019 at 17:49 Comment(1)

I'd argue it wasn't useful even when it was invented. – Theophany 16/12, 2022 at 0:36

Compilers are good at loop unrolling, but sometimes "obvious" optimizations can be suppressed when the compiler can't prove that it's right. In the Duff's Device case, the target was a memory-mapped register, and the source was an arbitrary pointer. Today, the memory-mapped register would likely have to be tagged as volatile and it's not clear whether the compiler could determine whether the source and destination pointers could ever alias. Either of these might possibly inhibit the optimization.

Things like memcpy (which is similar to but different than Duff's device) are often "special" functions known to the compiler that may have multiple hand-optimized variants built in. Expecting the compiler to generate memcpy from "first principles" may not produce as highly an optimized version as you might expect.

Duff's Device is not just about loop unrolling but how to handle the excess copies without an extra loop. This saves code space, which is probably less of an issue now. Whether compilers do the equivalent thing when loop unrolling--I don't know.

Is it useful? Possibly, in certain rare cases. That was arguably true when Duff's Device was originally invented as well.

Bohannan answered 8/2, 2019 at 18:31 Comment(1)

Duff's Device is completely unnecessary - until you hit that edge case where the compiler produces some absolutely crap code and you're desperate to shave microseconds off of an opt-repeated loop and your boss is screaming and the investors are bailing and you're tearing your hair (both of them!) out, and then you remember Duff's Device, and a smile slowly spreads across your face, because...yeah. You don't need it every day. You don't need it every month. You don't need it every year, or even every decade - but when you do, it's good to know you have it in your toolbox. For those days... – Antione 1/6, 2020 at 17:59

Loop unrolling is useful in asm, and sometimes in C when the compiler doesn't do it for you.

But Duff's Device specifically, nesting a do{}while inside a switch{} to do an indirect jump to any of 8 loop entry points, usually not useful. Possibly if you're optimizing for a machine that can increment pointers for free, e.g. pre- or post-increment addressing as part of a load (like ARM or PowerPC, but not x86 or RISC-V). Having multiple entry points into a loop means each pointer increment still has to get done separately, instead of add rdi, 64 for ptr+=8; and using addressing modes like [rdi+8], [rdi+16] to address ptr[1], ptr[2], etc.

It also constrains how the unrolled loop can be optimized, e.g. the compiler can't schedule the loads early to hide more load latency. This wasn't a thing on early computers where CPUs weren't pipelined. This can also impact auto-vectorization.

So it's generally not great even for an ISA that could increment pointers for free, except maybe on simple microcontrollers. Optimal asm for CPUs usually involves some code outside the loop to check if the iteration count is greater than unroll_count, otherwise use a fallback loop, especially when the total count is usually large so the remainder isn't a huge deal. Extra code outside the loop is usually better than trying to have 4 or 8 possible loop entry points that you select with a jump table or a chain or tree of compare/branch.

Whether you can get the compiler to unroll nicely for you or not depends on the code and the compiler. Clang usually does a pretty good job, including turning sum += arr[i] into multiple accumulators (the asm equivalent of sum0 += arr[i+0]; sum1 += arr[i+1]; etc.) especially with higher-latency operations like floating-point. GCC is normally terrible at that, unrolling but still adding into the same one accumulator (or single SIMD vector of accumulators), even with -O3 -ffast-math -funroll-loops, so manual vectorization and unrolling is more necessary with GCC for something like an FP dot product. But manually unrolling with multiple scalar float accumulators in C can be counter-productive if it stops the compiler from seeing the pattern and auto-vectorizing.

(GCC -O3 doesn't enable loop unrolling by default, only with PGO: -fprofile-generate / -fprofile-use.)

Related: tricks for manually vectorizing and handling an odd number of elements. That's a very similar problem to manually unrolling, except that doing 4, 8, or 16 elements or whatever for the price of 1 has implications for cleanup strategies to still use vectors. Especially with narrow elements and wide vectors, so the remainder is potentially a lot of elements for a scalar fallback; you might do some cleanup with narrower vectors.

Vinavinaceous answered 16/12, 2022 at 1:27 Comment(0)

-1

Duff’s device and other tricks us old guys used to shave machine cycles off functions, interrupts , and overall code is still relevant when the code runs for cpu years.

If you don’t have time to do right it seems you always find time to do it over. Yea

Pimento answered 16/12, 2022 at 0:7 Comment(0)

Recommended topics

Hot tags