How do I implement an efficient 32 bit DivMod in 64 bit code

Asked 28/11, 2013 at 16:11 Answered 29/11, 2013 at 14:26

I want to use a DivMod function that operates exclusively on 32 bit operands. The implementation in the RTL returns values in 16 bit variables. Its declaration is:

procedure DivMod(Dividend: Cardinal; Divisor: Word; var Result, Remainder: Word);

So, I cannot use that since my inputs may overflow the return values.

The naive Pascal implementation looks like this:

procedure DivMod(Dividend, Divisor: Cardinal; out Quotient, Remainder: Cardinal);
begin
  Quotient := Dividend div Divisor;
  Remainder := Dividend mod Divisor;
end;

This works splendidly but performs the division twice. Since the function is called by part of my code that is in a performance bottleneck, I would like to perform the division once only. To that end I am using Serg's 32 bit DivMod from this question: Is there a DivMod that is *not* Limited to Words (<=65535)?

procedure DivMod(Dividend, Divisor: Cardinal; out Quotient, Remainder: Cardinal);
asm
        PUSH EBX
        MOV  EBX,EDX
        XOR  EDX,EDX
        DIV  EBX
        MOV  [ECX],EAX
        MOV  EBX,Remainder
        MOV  [EBX],EDX
        POP  EBX
end;

This works perfectly.

But now I would like a version of the function for 64 bit code. Note that I still want to operate on 32 bit operands, and return 32 bit values.

Should I re-write the function using 64 bit assembler, or is it sufficient to use the DivMod overload from the RTL that operates on, and returns, 64 bit values?

Specifically I would like to know if there is a performance benefit in writing 64 bit code that does 32 bit operations. Is that even possible? Or would I simply end up re-implementing the DivMod overload with UInt64 parameters? If it is worth implementing a bespoke 64 bit asm version, how would I go about doing it, noting that the operands and operations are 32 bit.

I think that it would look like this, but I am no expert and likely have got something wrong:

procedure DivMod(Dividend, Divisor: Cardinal; out Quotient, Remainder: Cardinal);
asm
        MOV   EAX,ECX   // move Dividend to EAX
        MOV   ECX,EDX   // move Divisor to ECX
        XOR   EDX,EDX   // zeroise EDX
        DIV   ECX       // divide EDX:EAX by ECX
        MOV   [R8],EAX  // save quotient
        MOV   [R9],EDX  // save remainder
end;

Stillman answered 28/11, 2013 at 16:11 Comment(23)

Seems ok. About 33% faster than the purepascal version. – Vacla 28/11, 2013 at 17:19

@LURD Do you know if DIV ECX in the code above performs a 32 bit operation? Did I understand that right? Does it do unsigned divide of the 64 bit value EDX:EXA by ECX? And is it worth doing the 32 bit operation rather than the full 64 bit operation? – Stillman 28/11, 2013 at 17:22

Replaced 64-bit tag with x86-64 tag, as the question is specific to x86-64, not to 64-bit architectures in general. – Korry 28/11, 2013 at 17:26

@Korry Thank you. I'm not familiar with tagging conventions for those tags. – Stillman 28/11, 2013 at 17:27

@DavidHeffernan, I'm in deep water with 64 bit asm. Usually I look what the disassembly shows and see if there is a better way to handle the cpu registers. In this case I just did some profiling and tested the routine with some values and they worked. – Vacla 28/11, 2013 at 17:38

@DavidHeffernan: your code looks fine and yes, dividing by ECX will perform exactly the same as in Win32. If you only use this code internally, you might be able to remove the prolog and epilog by putting a .NOFRAME directive at the top. Do not use that for "public" functions, though. – Semiotics 28/11, 2013 at 17:50

I hadn't realized that Embarcadero finally got around to supporting inline assembly for x64. Fancy that. – Garwin 28/11, 2013 at 18:23

@Garwin Supported since XE2 – Stillman 28/11, 2013 at 18:30

@DavidHeffernan No way! I've had XE2 since it came out... I guess I just got used to the fact that it wasn't there and, being Embarcadero, resigned myself to the fact that it could well be forever before it ever happened (think undo in forms designer!). I suppose it pays to read the news now and again. – Garwin 28/11, 2013 at 18:48

@RudyVelthuis The compiler is smart enough to figure out, for this function, that no stack frame is needed. – Stillman 28/11, 2013 at 19:11

Does this need to be an entirely general function (iow; is your use case restricted to any sort of set of divisors or dividends)? – Garwin 28/11, 2013 at 19:19

@Garwin as it happens I'm implementing functions to convert from integer to string and so the divisor is always 10 – Stillman 28/11, 2013 at 19:30

@DavidHeffernan Integer division can be made a lot faster if you know the divisor at compile time (by means of magic constants and trading div for cheap mul and shr). Check out : libdivide.com Also section 9.2.4 of intel.com/content/dam/www/public/us/en/documents/manuals/… Some compilers do this automatically - delphi it seems not. – Garwin 28/11, 2013 at 19:57

@Garwin I wonder if it still works out faster when you need both quotient and remainder – Stillman 28/11, 2013 at 20:7

@DavidHeffernan I feel almost certainly the answer is yes, but this is untested. If I run into some boredom fun time I'll maybe have a go and see what I come up with. At worst it's one sub and one mul - at best I feel it could be even less. – Garwin 28/11, 2013 at 20:11

@DavidHeffernan I just did a quick test. For a known divisor of 10, even the naive approach of q := Trunc(i * 0.1); r := i - 10*q is 33% faster than a single q := i div 10;. This is without any trickery or optimization whatsoever. (64-bit doubles are guaranteed to accurately represent all 32-bit integers) – Garwin 28/11, 2013 at 21:19

@Garwin 0.1 is not representable. Your code may be faster but it is wrong. Surely. – Stillman 28/11, 2013 at 21:33

@DavidHeffernan It will not be wrong in the 11th digit, which is all that matters. – Garwin 28/11, 2013 at 21:48

@Garwin What if the closest float to 0.1 happens to be less than 0.1? As it happens it is not, but you got lucky. – Stillman 28/11, 2013 at 21:49

@DavidHeffernan I'll write a test and see. In any case, I know for a fact that libdivide is a pure integer library and is considerably faster than even the bit I've used above. If it's a real bottleneck, it's worth looking into. – Garwin 28/11, 2013 at 21:56

@DavidHeffernan Conclusively, then, I can say the above is accurate for all positive 32-bit integers. If 0.1 was less than 0.1, it would be a simple matter of adding a bit to the mantissa and using the first double higher than 0.1. No luck needed, just a bit of determination. ;) – Garwin 28/11, 2013 at 22:24

Is the delphi compiler really too dumb to combine mod and div with a common divisor? – Paronym 29/11, 2013 at 14:38

@Paronym Yes it is. – Stillman 29/11, 2013 at 14:40

I dug a bit deeper. I think it would be perfectly reasonably to implement this on top of the UInt64 version. That would look like this:

procedure DivMod(Dividend, Divisor: Cardinal; out Quotient, Remainder: Cardinal);
var
  Quotient64, Remainder64: UInt64;
begin
  DivMod(Dividend, Divisor, Quotient64, Remainder64);
  Quotient := Quotient64;
  Remainder := Remainder64;
end;

I don't think the performance would be very significantly affected in comparison to the most optimal asm version.

However, I believe that the x64 asm code in the question is correct. The MOV instructions are all fine with 32 bit operands. And the DIV is also as described in the comment in the asm code. The Intel documentation for DIV r/m32 says:

Unsigned divide EDX:EAX by r/m32, with result stored in EAX ← Quotient, EDX ← Remainder.

And let's take a look at what the Delphi compiler does with this code:

var
  a, b, c, d: Cardinal;
....
a := 666;
b := 42;
c := a div b;
d := a mod b;

The code that is produced is:

    
Project39.dpr.14: a := 666;
0000000000423A68 C7450C9A020000   mov [rbp+$0c],$0000029a
Project39.dpr.15: b := 42;
0000000000423A6F C745082A000000   mov [rbp+$08],$0000002a
Project39.dpr.16: c := a div b;
0000000000423A76 8B450C           mov eax,[rbp+$0c]
0000000000423A79 33D2             xor edx,edx
0000000000423A7B F77508           div dword ptr [rbp+$08]
0000000000423A7E 894504           mov [rbp+$04],eax
Project39.dpr.17: d := a mod b;
0000000000423A81 8B450C           mov eax,[rbp+$0c]
0000000000423A84 33D2             xor edx,edx
0000000000423A86 F77508           div dword ptr [rbp+$08]
0000000000423A89 895500           mov [rbp+$00],edx

I don't have any expectation that the 32 bit divide will be more efficient than a 64 bit divide, but that doesn't really matter. It seems more natural to perform the 32 bit operation with 32 bit operands.

Stillman answered 28/11, 2013 at 18:14 Comment(2)

"Widened" DivMod prototype is ought to be (QWord, LongWord, &LongWord, &LongWord) – Intransigeance 28/11, 2013 at 23:10

@Free For my needs, 32 bit operands are what is needed. QWORD dividend fits with the hardware's capabilities. – Stillman 28/11, 2013 at 23:27

For the special case of always dividing by 10 (per comments) you can do something like this :

procedure DivMod10(num : Cardinal; var q, r : Cardinal); inline;
var
  rl : uInt64;
begin
  rl := UInt64(3435973837)*num;
  q := rl shr 35;
  r := num - q*10;
end;

The algorithm varies depending on the denominator, but the source for determining it and the magic numbers can be found in libdivide. This is tested accurate for all unsigned 32-bit integers and is about 3 times faster than using div (and provides the remainder).

Benchmark (optimizations on):

  t0 := GetTickCount;
  for I := 1 to 999999999 do begin
    DivMod10(i, q, r);
  end;
  ShowMessage(IntToStr(GetTickCount - t0));  // result :  1809

  t0 := GetTickCount;
  for I := 1 to 999999999 do begin
    q := i div 10;
  end;
  ShowMessage(IntToStr(GetTickCount - t0));  // result :  5336

Test :

for I := 1 to High(Cardinal) do begin
  DivMod10(i,q,r);
  if q <> (i div 10) then WriteLn(IntToStr(i));
  // no mismatch found
end;

Garwin answered 29/11, 2013 at 14:26 Comment(5)

+1 Nice. Thanks. I'll look into this. Of course the way you get the remainder is a neat way to avoid the second divide in the pure pascal version too. I mean: Quotient := Dividend div Divisor; Remainder := Dividend - Quotient*Divisor; – Stillman 29/11, 2013 at 14:31

@DavidHeffernan this is the straight arithmetic algorithm. Throwing a DLL wrapper around libdivide would provide a few others that use SSE/SSE2 (unswitched vector algorithm for div 10 is even faster than this!). This implementation was quick to port but there are others. – Garwin 29/11, 2013 at 15:26

Just to be clear, I'm not going to accept this answer because it doesn't address directly the question that I asked. It does produce a nice solution to the problem that I am trying to solve mind you. Hope you understand. Anyway, it's been fun!! – Stillman 30/11, 2013 at 16:4

Interestingly, the divide free version is slower in 32 bit code! That's presumably because 64 bit multiply is expensive, what with it not being supported on hardware. – Stillman 30/11, 2013 at 16:26

@DavidHeffernan Yes, I agree it does not answer the general question (although the full libdivide library does, albeit in C/C++, provide a full, optimized general runtime solution). I also noticed it was slower in 32-bit, as you say, because 64-bit IMUL is not available natively. Re-tooling it for SSE/SSE2 would regain the speed for a 32-bit compile (I'm guessing, even with the obligatory stack frame overhead for asm). Libdivide does this automatically (although for compile-time known divisors it can still be faster to hardcode the magic number rather than calculate it on the fly). – Garwin 30/11, 2013 at 17:14