How does Duff's device work?

Asked 5/2, 2009 at 1:6 Answered 29/8, 2020 at 15:55

180

I've read the article on Wikipedia on the Duff's device, and I don't get it. I am really interested, but I've read the explanation there a couple of times and I still don't get it how the Duff's device works.

What would a more detailed explanation be?

Washington answered 5/2, 2009 at 1:6 Comment(0)

270

There are some good explanations elsewhere, but let me give it a try. (This is a lot easier on a whiteboard!) Here's the Wikipedia example with some notations.

Let's say you're copying 20 bytes. The flow control of the program for the first pass is:

int count;                        // Set to 20
{
    int n = (count + 7) / 8;      // n is now 3.  (The "while" is going
                                  //              to be run three times.)

    switch (count % 8) {          // The remainder is 4 (20 modulo 8) so
                                  // jump to the case 4

    case 0:                       // [skipped]
             do {                 // [skipped]
                 *to = *from++;   // [skipped]
    case 7:      *to = *from++;   // [skipped]
    case 6:      *to = *from++;   // [skipped]
    case 5:      *to = *from++;   // [skipped]
    case 4:      *to = *from++;   // Start here.  Copy 1 byte  (total 1)
    case 3:      *to = *from++;   // Copy 1 byte (total 2)
    case 2:      *to = *from++;   // Copy 1 byte (total 3)
    case 1:      *to = *from++;   // Copy 1 byte (total 4)
           } while (--n > 0);     // N = 3 Reduce N by 1, then jump up
                                  //       to the "do" if it's still
    }                             //        greater than 0 (and it is)
}

Now, start the second pass, we run just the indicated code:

int count;                        //
{
    int n = (count + 7) / 8;      //
                                  //

    switch (count % 8) {          //
                                  //

    case 0:                       //
             do {                 // The while jumps to here.
                 *to = *from++;   // Copy 1 byte (total 5)
    case 7:      *to = *from++;   // Copy 1 byte (total 6)
    case 6:      *to = *from++;   // Copy 1 byte (total 7)
    case 5:      *to = *from++;   // Copy 1 byte (total 8)
    case 4:      *to = *from++;   // Copy 1 byte (total 9)
    case 3:      *to = *from++;   // Copy 1 byte (total 10)
    case 2:      *to = *from++;   // Copy 1 byte (total 11)
    case 1:      *to = *from++;   // Copy 1 byte (total 12)
           } while (--n > 0);     // N = 2 Reduce N by 1, then jump up
                                  //       to the "do" if it's still
    }                             //       greater than 0 (and it is)
}

Now, start the third pass:

int count;                        //
{
    int n = (count + 7) / 8;      //
                                  //

    switch (count % 8) {          //
                                  //

    case 0:                       //
             do {                 // The while jumps to here.
                 *to = *from++;   // Copy 1 byte (total 13)
    case 7:      *to = *from++;   // Copy 1 byte (total 14)
    case 6:      *to = *from++;   // Copy 1 byte (total 15)
    case 5:      *to = *from++;   // Copy 1 byte (total 16)
    case 4:      *to = *from++;   // Copy 1 byte (total 17)
    case 3:      *to = *from++;   // Copy 1 byte (total 18)
    case 2:      *to = *from++;   // Copy 1 byte (total 19)
    case 1:      *to = *from++;   // Copy 1 byte (total 20)
           } while (--n > 0);     // N = 1  Reduce N by 1, then jump up
                                  //       to the "do" if it's still
    }                             //       greater than 0 (and it's not, so bail)
}                                 // continue here...

20 bytes are now copied.

Note: The original Duff's Device (shown above) copied to an I/O device at the to address. Thus, it wasn't necessary to increment the pointer *to. When copying between two memory buffers you'd need to use *to++.

Hurlee answered 5/2, 2009 at 2:25 Comment(5)

How can the case 0: clause be skipped and continue checking for the other clauses which are inside the do while loop that is the argument of the skipped clause? If the only clause that is outside the do while loop is skipped, why the switch doesn't end there? – Diarrhoea 17/9, 2015 at 14:26

Don't look at the braces so hard. Don't look at the do so much. Instead, look at the switch and the while as old-fashioned computed GOTO statments or assembler jmp statements with an offset. The switch does some math and then jmps to the right place. The while does a boolean check and then blindly jmps to right about where the do was. – Hurlee 17/9, 2015 at 18:40

If this is so good , why doesn't everyone use this? Are there any drawbacks? – Zodiac 25/2, 2019 at 5:14

@Zodiac Readability. – Disc 8/1, 2020 at 10:0

@Zodiac modern compilers are able to utilize something similar called loop unrolling. – Breastbeating 18/9, 2020 at 22:58

118

The explanation in Dr. Dobb's Journal is the best that I found on the topic.

This being my AHA moment:

for (i = 0; i < len; ++i) {
    HAL_IO_PORT = *pSource++;
}

becomes:

int n = len / 8;
for (i = 0; i < n; ++i) {
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
}

n = len % 8;
for (i = 0; i < n; ++i) {
    HAL_IO_PORT = *pSource++;
}

becomes:

int n = (len + 8 - 1) / 8;
switch (len % 8) {
    case 0: do { HAL_IO_PORT = *pSource++;
    case 7: HAL_IO_PORT = *pSource++;
    case 6: HAL_IO_PORT = *pSource++;
    case 5: HAL_IO_PORT = *pSource++;
    case 4: HAL_IO_PORT = *pSource++;
    case 3: HAL_IO_PORT = *pSource++;
    case 2: HAL_IO_PORT = *pSource++;
    case 1: HAL_IO_PORT = *pSource++;
               } while (--n > 0);
}

Hoke answered 5/2, 2009 at 1:15 Comment(6)

The crucial fact here, and which made Duff's device incomprehensible to me for the longest time, is that by a quirk of C, after the first time it reaches the while, it jumps back and executes all the statements. Thus even if len%8 was 4, it will execute the case 4, case 2, case 2, and case 1, and then jump back and execute all the cases from the next loop onwards. This is the part that needs explaining, the way the loop and the switch statement "interact". – Pyroligneous 24/1, 2012 at 12:24

The Dr. Dobbs article is good however aside from the link the answer does not add anything. See Rob Kennedy's answer below which actually provides an important point about the remainder of the transfer size being handled first followed by zero or more transfer blocks of 8 bytes. In my opinion that is the key to understanding this code. – Playwriting 20/4, 2013 at 23:3

Am I missing something, or in the second code snippet len % 8 bytes will not be copied? – Engadine 30/3, 2014 at 21:54

I was stuck, forgetting that If you don't write a break statement at the end of a case's statement list, C (or any other language) will go on executing the statements. So if you're wondering why duff's device works at all, this is a crucial part of it – Pissed 29/7, 2014 at 4:24

As I read Dr. Dobbs article Duff's device is nothing more than a way to avoid the postprocessing cleanup loop for sizes which are not a multiple of the unrolling. So I guess it saves a few instructions for devices for devices which are memory constrained. But it does not seem to handle arbitrary unrolling sizes. You still have to write each case. It would be really useful to have a method where you could say "unroll n times". – Caneghem 3/9, 2014 at 21:51

@newbie: That was true, but I've now pasted the postprocessing loop in from the article. (I suspect it was mistakenly omitted because the two loops are separated by prose in the article.) – Delegation 27/9, 2018 at 3:10

There are two key things to Duff's device. First, which I suspect is the easier part to understand, the loop is unrolled. This trades larger code size for more speed by avoiding some of the overhead involved in checking whether the loop is finished and jumping back to the top of the loop. The CPU can run faster when it's executing straight-line code instead of jumping.

The second aspect is the switch statement. It allows the code to jump into the middle of the loop the first time through. The surprising part to most people is that such a thing is allowed. Well, it's allowed. Execution starts at the calculated case label, and then it falls through to each successive assignment statement, just like any other switch statement. After the last case label, execution reaches the bottom of the loop, at which point it jumps back to the top. The top of the loop is inside the switch statement, so the switch is not re-evaluated anymore.

The original loop is unwound eight times, so the number of iterations is divided by eight. If the number of bytes to be copied isn't a multiple of eight, then there are some bytes left over. Most algorithms that copy blocks of bytes at a time will handle the remainder bytes at the end, but Duff's device handles them at the beginning. The function calculates count % 8 for the switch statement to figure what the remainder will be, jumps to the case label for that many bytes, and copies them. Then the loop continues to copy groups of eight bytes.

Thirdrate answered 5/2, 2009 at 1:54 Comment(2)

This explanation makes more sense. the key for me to understand that the remainder is copied first then the rest in blocks of 8 bytes which is unusual since as mentioned most of the time, you would copy in blocks of 8 bytes and then copy the remainder. doing the remainder first is the key to understanding this algorithm. – Playwriting 20/4, 2013 at 22:55

+1 for mentioning the crazy placement/nesting of switch / while loop. Impossible to imagine coming from a language like Java... – Brython 3/2, 2014 at 7:6

The point of duffs device is to reduce the number of comparisons done in a tight memcpy implementation.

Suppose you want to copy 'count' bytes from b to a, the straight forward approach is to do the following:

  do {                      
      *a = *b++;            
  } while (--count > 0);

How many times do you need to compare count to see if it's a above 0? 'count' times.

Now, the duff device uses a nasty unintentional side effect of a switch case which allows you to reduce the number of comparisons needed to count / 8.

Now suppose you want to copy 20 bytes using duffs device, how many comparisons would you need? Only 3, since you copy eight bytes at a time except the ~~last~~ first one where you copy just 4.

UPDATED: You don't have to do 8 comparisons/case-in-switch statements, but it's reasonable a trade-off between function size and speed.

Boltonia answered 5/2, 2009 at 1:17 Comment(5)

Note that duff's device is not limited to 8 duplications in the switch statement. – Sells 5/2, 2009 at 1:29

why can't you just use instead of --count, count = count-8? and use a second loop to deal with the remainder? – Washington 5/2, 2009 at 1:36

Hhafez, you can use a second loop to deal with the remainder. But now you've twice as much code to accomplish the same thing with no speed increase. – Thirdrate 5/2, 2009 at 1:39

Johan, you have it backward. The remaining 4 bytes are copied on the first iteration of the loop, not the last. – Thirdrate 5/2, 2009 at 1:40

Not sure that it's a side effect, or nasty, or unintentional. I think it's just how switch works – Planetary 3/3, 2021 at 15:19

When I read it for the first time, I autoformatted it to this

void dsend(char* to, char* from, count) {
    int n = (count + 7) / 8;
    switch (count % 8) {
        case 0: do {
                *to = *from++;
                case 7: *to = *from++;
                case 6: *to = *from++;
                case 5: *to = *from++;
                case 4: *to = *from++;
                case 3: *to = *from++;
                case 2: *to = *from++;
                case 1: *to = *from++;
            } while (--n > 0);
    }
}

and I had no idea what was happening.

Maybe not when this question was asked, but now Wikipedia has a very good explanation

The device is valid, legal C by virtue of two attributes in C:

Relaxed specification of the switch statement in the language's definition. At the time of the device's invention this was the first edition of The C Programming Language which requires only that the controlled statement of the switch be a syntactically valid (compound) statement within which case labels can appear prefixing any sub-statement. In conjunction with the fact that, in the absence of a break statement, the flow of control will fall-through from a statement controlled by one case label to that controlled by the next, this means that the code specifies a succession of count copies from sequential source addresses to the memory-mapped output port.

The ability to legally jump into the middle of a loop in C.

Clomp answered 28/8, 2010 at 12:9 Comment(1)

Are you missing incrementing of the to pointer? – Underdeveloped 2/1 at 21:11

1: Duffs device is a particular implementation of loop unrolling. Loop unrolling is an optimisation technique applicable if you have an operation to perform N times in a loop - you can trade program size for speed by executing the loop N/n times and then in the loop inlining (unrolling) the loop code n times e.g. replacing:

for (int i=0; i<N; i++) {
    // [The loop code...] 
}

with

for (int i=0; i<N/n; i++) {
    // [The loop code...]
    // [The loop code...]
    // [The loop code...]
    ...
    // [The loop code...] // n times!
}

Which works great if N % n == 0 - no need for Duff! If that is not true then you have to handle the remainder - which is a pain.

2: How does Duffs device differ from this standard loop unrolling?
Duffs device is just a clever way of dealing with the remainder loop cycles when N % n != 0. The whole do / while executes N / n number of times as per standard loop unrolling (because the case 0 applies). On the ~~last~~ first run through the loop the case kicks in and we run the loop code the 'remainder' number of times - the remaining runs through the loop run 'normally'.

Vergievergil answered 20/6, 2013 at 8:11 Comment(1)

I got interest in Duffs device this following this question: https://mcmap.net/q/104020/-switch-case-weird-scoping so thought Id have a go at clarifing Duff - not sure if it is any improvement on existing answers... – Vergievergil 20/6, 2013 at 8:15

Though I'm not 100% sure what you're asking for, here goes...

The issue that Duff's device addresses is one of loop unwinding (as you'll no doubt have seen on the Wiki link you posted). What this basically equates to is an optimisation of run-time efficiency, over memory footprint. Duff's device deals with serial copying, rather than just any old problem, but is a classic example of how optimisations can be made by reducing the number of times that a comparison needs to be done in a loop.

As an alternative example, which may make it easier to understand, imagine you have an array of items you wish to loop over, and add 1 to them each time... ordinarily, you might use a for loop, and loop around 100 times. This seems fairly logical and, it is... however, an optimisation can be made by unwinding the loop (obviously not too far... or you may as well just not use the loop).

So a regular for loop:

for(int i = 0; i < 100; i++)
{
    myArray[i] += 1;
}

becomes

for(int i = 0; i < 100; i+10)
{
    myArray[i] += 1;
    myArray[i+1] += 1;
    myArray[i+2] += 1;
    myArray[i+3] += 1;
    myArray[i+4] += 1;
    myArray[i+5] += 1;
    myArray[i+6] += 1;
    myArray[i+7] += 1;
    myArray[i+8] += 1;
    myArray[i+9] += 1;
}

What Duff's device does is implement this idea, in C, but (as you saw on the Wiki) with serial copies. What you're seeing above, with the unwound example, is 10 comparisons compared to 100 in the original - this amounts to a minor, but possibly significant, optimisation.

Darnel answered 5/2, 2009 at 1:19 Comment(2)

You're missing the key part. It's not just about loop unwinding. The switch statement jumps into the middle of the loop. That's what makes the device look so confusing. Your loop above always performs a multiple of 10 copies, but Duff's performs any number. – Thirdrate 5/2, 2009 at 1:37

That's true - but I was attempting to simplify the description for the OP. Perhaps I didn't clear that up enough! :) – Darnel 5/2, 2009 at 8:20

Here's a non-detailed explanation which is what I feel to be the crux of Duff's device:

The thing is, C is basically a nice facade for assembly language (PDP-7 assembly to be specific; if you studied that you would see how striking the similarities are). And, in assembly language, you don't really have loops - you have labels and conditional-branch instructions. So the loop is just a part of the overall sequence of instructions with a label and a branch somewhere:

        instruction
label1: instruction
        instruction
        instruction
        instruction
        jump to label1 if some_condition

and a switch instruction is branching/jumping ahead somewhat:

        evaluate expression into register r
        compare r with first case value
        branch to first case label if equal
        compare r with second case value
        branch to second case label if equal
        etc....
first_case_label: 
        instruction
        instruction
second_case_label: 
        instruction
        instruction
        etc...

In assembly it's easily conceivable how to combine these two control structures, and when you think of it that way, their combination in C doesn't seem so weird anymore.

Medic answered 11/3, 2016 at 22:40 Comment(0)

This is an answer I posted to another question about Duff's Device that got some upvaotes before the question was closed as a duplicate. I think it provides a bit of valuable context here on why you should avoid this construct.

"This is Duff's Device. It's a method of unrolling loops that avoids having to add a secondary fix-up loop to deal with times when the number of loop iteration isn't know to be an exact multiple of the unrolling factor.

Since most answers here seem to be generally positive about it I'm going to highlight the downsides.

With this code a compiler is going to struggle to apply any optimization to the loop body. If you just wrote the code as a simple loop a modern compiler should be able to handle the unrolling for you. This way you maintain readability and performance and have some hope of other optimizations being applied to the loop body.

The Wikipedia article referenced by others even says when this 'pattern' was removed from the Xfree86 source code performance actually improved.

This outcome is typical of blindly hand optimizing any code you happen to think might need it. It prevents the compiler from doing its job properly, makes your code less readable and more prone to bugs and typically slows it down. If you were doing things the right way in the first place, i.e. writing simple code, then profiling for bottlenecks, then optimizing, you'd never even think to use something like this. Not with a modern CPU and compiler anyway.

It's fine to understand it, but I'd be surprised if you ever actually use it."

Nurse answered 23/4, 2020 at 6:52 Comment(0)

Just experimenting, found another variant getting along without interleaving switch statement and do-while-loop:

int n = (count + 1) / 8;
switch (count % 8)
{
    LOOP:
case 0:
    if(n-- == 0)
        break;
    putchar('.');
case 7:
    putchar('.');
case 6:
    putchar('.');
case 5:
    putchar('.');
case 4:
    putchar('.');
case 3:
    putchar('.');
case 2:
    putchar('.');
case 1:
    putchar('.');
default:
    goto LOOP;
}

Technically, the goto still implements a loop, but this variant might be slightly more readable.

Jurisprudence answered 20/7, 2016 at 13:24 Comment(3)

Where is your terminating condition? – Drida 13/3, 2020 at 9:53

"without interleaving switch and loop" - and then saying "switch () LOOP: case 0:" without blinking. How is that not interleaving? – Dolley 23/10, 2020 at 20:54

@Dolley Was meant as 'language construct of a loop', i. e. interleaving switch and do {} while, which appears strange and possibly difficult to understand. Admitted, wording is not too precise, as the goto construct still technically implements a loop, just intended to find a more readable variant of. – Jurisprudence 6/11, 2020 at 20:45

Here is a working example for 64-bit memcpy with Duff's Device:

#include <iostream>
#include <memory>

inline void __memcpy(void* to, const void* from, size_t count)
{
    size_t numIter = (count  + 56) / 64;  // gives the number of iterations;  bit shift actually, not division
    size_t rest = count & 63; // % 64
    size_t rest7 = rest&7;
    rest -= rest7;

    // Duff's device with zero case handled:
    switch (rest) 
    {
        case 0:  if (count < 8)
                     break;
                 do { *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 56:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 48:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 40:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 32:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 24:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 16:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 8:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
                } while (--numIter > 0);
    }

    switch (rest7)
    {
        case 7: *(((unsigned char*)to)+6) = *(((unsigned char*)from)+6);
        case 6: *(((unsigned short*)to)+2) = *(((unsigned short*)from)+2); goto case4;
        case 5: *(((unsigned char*)to)+4) = *(((unsigned char*)from)+4);
        case 4: case4: *((unsigned long*)to) = *((unsigned long*)from); break; 
        case 3: *(((unsigned char*)to)+2) = *(((unsigned char*)from)+2);
        case 2: *((unsigned short*)to) = *((unsigned short*)from); break;
        case 1: *((unsigned char*)to) = *((unsigned char*)from);
    }
}

void main()
{
    static const size_t NUM = 1024;

    std::unique_ptr<char[]> str1(new char[NUM+1]);  
    std::unique_ptr<char[]> str2(new char[NUM+1]);

    for (size_t i = 0 ; i < NUM ; ++ i)
    {
        size_t idx = (i % 62);
        if (idx < 26)
            str1[i] = 'a' + idx;
        else
            if (idx < 52)
                str1[i] = 'A' + idx - 26;
            else
                str1[i] = '0' + idx - 52;
    }

    for (size_t i = 0 ; i < NUM ; ++ i)
    {
        memset(str2.get(), ' ', NUM); 
        __memcpy(str2.get(), str1.get(), i);
        if (memcmp(str1.get(), str2.get(), i) || str2[i] != ' ')
        {
            std::cout << "Test failed for i=" << i;
        }

    }

    return;
}

It handles zero-length case (in original Duff's Device there is assumption num>0). Function main() contains simple test cases for __memcpy.

Perfumer answered 29/8, 2020 at 15:55 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags