Checking parameters of multiplication by constant in 64 bit
Asked Answered
C

3

6

For my BigInteger code, output turned out to be slow for very large BigIntegers. So now I use a recursive divide-and-conquer algorithm, which still needs 2'30" to convert the currently largest known prime to a decimal string of more than 22 million digits (but only 135 ms to turn it into a hexadecimal string).

I still want to reduce the time, so I need a routine that can divide a NativeUInt (i.e. UInt32 on 32 bit platforms, UInt64 on 64 bit platforms) by 100 very fast. So I use multiplication by constant. This works fine in 32 bit code, but I am not 100% sure for 64 bit.

So my question: is there a way to check the reliability of the results of multiplication by constant for unsigned 64 bit values? I checked the 32 bit values by simply trying with all values of UInt32 (0..$FFFFFFFF). This took approx. 3 minutes. Checking all UInt64s would take much longer than my lifetime. Is there a way to check if the parameters used (constant, post-shift) are reliable?

I noticed that DivMod100() always failed for a value like $4000004B if the chosen parameters were wrong (but close). Are there special values or ranges to check for 64 bit, so I don't have to check all values?

My current code:

const
{$IF DEFINED(WIN32)}
  // Checked
  Div100Const = UInt32(UInt64($1FFFFFFFFF) div 100 + 1);
  Div100PostShift = 5;
{$ELSEIF DEFINED(WIN64)}
  // Unchecked!!
  Div100Const = $A3D70A3D70A3D71; 
  // UInt64(UInt128($3 FFFF FFFF FFFF FFFF) div 100 + 1); 
  // UInt128 is fictive type.
  Div100PostShift = 2;
{$IFEND}

// Calculates X div 100 using multiplication by a constant, taking the
// high part of the 64 bit (or 128 bit) result and shifting
// right. The remainder is calculated as X - quotient * 100;
// This was tested to work safely and quickly for all values of UInt32.
function DivMod100(var X: NativeUInt): NativeUInt;
{$IFDEF WIN32}
asm
        // EAX = address of X, X is UInt32 here.
        PUSH    EBX
        MOV     EDX,Div100Const
        MOV     ECX,EAX
        MOV     EAX,[ECX]
        MOV     EBX,EAX
        MUL     EDX
        SHR     EDX,Div100PostShift
        MOV     [ECX],EDX       // Quotient

        // Slightly faster than MUL

        LEA     EDX,[EDX + 4*EDX] // EDX := EDX * 5;
        LEA     EDX,[EDX + 4*EDX] // EDX := EDX * 5;
        SHL     EDX,2             // EDX := EDX * 4; 5*5*4 = 100.

        MOV     EAX,EBX
        SUB     EAX,EDX         // Remainder
        POP     EBX
end;
{$ELSE WIN64}
asm
        .NOFRAME

        // RCX is address of X, X is UInt64 here.
        MOV     RAX,[RCX]
        MOV     R8,RAX
        XOR     RDX,RDX
        MOV     R9,Div100Const
        MUL     R9
        SHR     RDX,Div100PostShift
        MOV     [RCX],RDX      // Quotient

        // Faster than LEA and SHL

        MOV     RAX,RDX
        MOV     R9D,100
        MUL     R9
        SUB     R8,RAX
        MOV     RAX,R8         // Remainder
end;
{$ENDIF WIN32}
Chowchow answered 30/1, 2016 at 10:35 Comment(5)
This is close to a dupe of stackoverflow.com/questions/20270596 but in any case you'll find the answer there by reading libdivideLaflamme
I used libdivide to generate a constant, but it amounts to $1C0000000000000000 div 100 + 1 with a post-shift of 6, but the result is not n div 100 in the high part. libdivide gives the results I expect for 32 bit, but perhaps I don't understand the way it is used for 64 bit. I'll experiment a little more.Chowchow
Please provide an answer.Laflamme
@DavidHeffernan: Ok, I found how to do it properly, using libdivide.h. Apparently there is a shift/add step required. Works fine now. Should I post the solution as an answer, or just edit the question?Chowchow
OK, will post the answer.Chowchow
M
2

As usual when writing optimized code, use compiler output for hints / starting points. It's safe to assume any optimization it makes is safe in the general case. Wrong-code compiler bugs are rare.

gcc implements unsigned 64bit divmod with a constant of 0x28f5c28f5c28f5c3. I haven't looked in detail into generating constants for division, but there are algorithms for generating them that will give known-good results (so exhaustive testing isn't needed).

The code actually has a few important differences: it uses the constant differently from the OP's constant.

See the comments for an analysis of what this is is actually doing: divide by 4 first, so it can use a constant that only works for dividing by 25 when the dividend is small enough. This also avoids needing an add at all, later on.

#include <stdint.h>

// rem, quot ordering takes one extra instruction
struct divmod { uint64_t quotient, remainder; }
 div_by_100(uint64_t x) {
    struct divmod retval = { x%100, x/100 };
    return retval;
}

compiles to (gcc 5.3 -O3 -mtune=haswell):

    movabs  rdx, 2951479051793528259
    mov     rax, rdi            ; Function arg starts in RDI (SysV ABI)
    shr     rax, 2
    mul     rdx
    shr     rdx, 2
    lea     rax, [rdx+rdx*4]    ; multiply by 5
    lea     rax, [rax+rax*4]    ; multiply by another 5
    sal     rax, 2              ; imul rax, rdx, 100 is better here (Intel SnB).
    sub     rdi, rax
    mov     rax, rdi
    ret
; return values in rdx:rax

Use the "binary" option to see the constant in hex, since disassembler output does it that way, unlike gcc's asm source output.


The multiply-by-100 part.

gcc uses the above sequence of lea/lea/shl, the same as in your question. Your answer is using a mov imm/mul sequence.

Your comments each say the version they chose is faster. If so, it's because of some subtle instruction alignment or other secondary effect: On Intel SnB-family, it's the same number of uops (3), and the same critical-path latency (mov imm is off the critical path, and mul is 3 cycles).

clang uses what I think is the best bet (imul rax, rdx, 100). I thought of it before I saw that clang chose it, not that that matters. That's 1 fused-domain uop (which can execute on p0 only), still with 3c latency. So if you're latency-bound using this routine for multi-precision, it probably won't help, but it is the best choice. (If you're latency-bound, inlining your code into a loop instead of passing one of the parameters through memory could save a lot of cycles.)

imul works because you're only using the low 64b of the result. There's no 2 or 3 operand form of mul because the low half of the result is the same regardless of signed or unsigned interpretation of the inputs.

BTW, clang with -march=native uses mulx for the 64x64->128, instead of mul, but doesn't gain anything by it. According to Agner Fog's tables, it's one cycle worse latency than mul.


AMD has worse than 3c latency for imul r,r,i (esp. the 64b version), which is maybe why gcc avoids it. IDK how much work gcc maintainers put into tweaking costs so settings like -mtune=haswell work well, but a lot of code isn't compiled with any -mtune setting (even one implied by -march), so I'm not surprised when gcc makes choices that were optimal for older CPUs, or for AMD.

clang still uses imul r64, r64, imm with -mtune=bdver1 (Bulldozer), which saves m-ops but at a cost of 1c latency more than using lea/lea/shl. (lea with a scale>1 is 2c latency on Bulldozer).

Momus answered 31/1, 2016 at 23:25 Comment(5)
@user246408: If I'm following the comments on your answer correct, clang and gcc probably do it this way so they don't need a 65bit add-and-shift, right? This does look cheaper than than Rudy's code.Momus
Sorry, I've deleted my comments; because I did not look enough into the gcc code; yes, it uses only shift, not add-and-shift; the constant 0x28f5c28f5c28f5c3 is the same most signicant bits of reciprocal of 1/25 (or 1/100, they are the same), only bit-shifted right by 2 bits. Given the gcc code is correct, it uses optimization based on the the fact that the dividend after preshifting right 2 bits is less than 0x4000000000000000. Summing up: though general 64-bit constant for division by 25 does not exist, it exists for dividends less than 0x4000000000000000.Sanative
@user246408: thanks for that analysis. I didn't take the time to grok the whole picture, I mostly just wanted to contribute a "look what the compiler does" answer, without taking the time to understand exactly why the compiler was able to do that.Momus
Would be nice if you add the gcc code for div_by_25. If my analysis is correct, it should be different from div_by_100 because add-and-shift thing is needed.Sanative
@user246408: godbolt makes it super-easy for anyone to go and try code changes and see the same asm output that I copy/pasted. That's why I put godbolt links in all my asm-output answers. Anyway, here's a link with div_by_100 and div_by_25. You're right, it shifts by one and adds. The constant it uses is 0x47ae147ae147ae15, to save the trouble of flipping godbolt to "binary" mode to get hex. Interestingly, ICC13 uses that constant for both div_by_100 and div_by_25. Instead of multiplying by 25 or 100, it multiplies by -25 or -100.Momus
C
1

I found the solution with libdivide.h. Here is the slightly more complicated part for Win64:

{$ELSE WIN64}
asm
        .NOFRAME

        MOV     RAX,[RCX]
        MOV     R8,RAX
        XOR     RDX,RDX
        MOV     R9,Div100Const       // New: $47AE147AE147AE15
        MUL     R9                   // Preliminary result Q in RDX

        // Additional part: add/shift

        ADD     RDX,R8               // Q := Q + X shr 1;
        RCR     RDX,1

        SHR     RDX,Div100PostShift  // Q := Q shr 6;
        MOV     [RCX],RDX            // X := Q;

        // Faster than LEA and SHL

        MOV     RAX,RDX
        MOV     R9D,100
        MUL     R9
        SUB     R8,RAX
        MOV     RAX,R8         // Remainder
end;
{$ENDIF WIN32}
Chowchow answered 30/1, 2016 at 15:42 Comment(9)
Why not (Q + X) shr 1 instead of Q + (X - Q) shr 1?Sanative
Hmmm... I took this from libdivide.h. I guess the intermediate result might overflow, for some values. I can try, though. If necessary, I can try to use RCR instead. Thanks for the hint.Chowchow
@user246408: You are right. I used RCR to shift back in the carry in case of an overflow, and it is slightly simpler and faster now. I edited the answer.Chowchow
Interesting trick with rcr to get an extra bit of intermediate precision. It's a 3 uop instruction on Intel SnB-family, (replacing three 1uop insns) so the change doesn't save any uops there. It's only 1 m-op on AMD, though, so it saves two macro-ops there. Note that rcr by an immediate count other than 1 is significantly slower, so it's not useful even if you could combine its right-shift with the following shr. rcr is 2c latency (Intel), and the mov wasn't on the critical path (and is zero latency anyway on IvB and later), so again it's a wash. (sub and shl r,1 are 1c)Momus
@PeterCordes: So MOV R9,R8; SUB R9,RDX; SHR R9,1; ADD RDX,R9 is equivalent (qua timing, I mean -- I know it gives the same result) to the above while it avoids the "65th bit"?Chowchow
@RudyVelthuis: on Intel SnB-family, yes same uop count and latency. Maybe different port requirements. (e.g. IvB and later don't need a port for the mov). On Pentium-M to Nehalem, it's 2 uops (still with 2c latency). On AMD (and PII/PIII), the add / rcr 1 is faster. On Silvermont, rcr 1 is 7uops (while simple instructions are still 1). I've seen the Q + (X - Q)/2 idiom before, in C, for calculating an average while avoiding overflow/carry. Anyway, you've stumbled onto another one of those "faster on one CPU, slower on others" cases.Momus
You don't need to zero rdx before mul. It's a write-only operand for mul, unlike div. And like I pointed out in my answer, imul rax, rdx, 100 is better than lea / lea / shl. clang even uses it. Clang's version is 9 uops (Intel Haswell) to your 12 (not counting your loads and stores or the wasted xor).Momus
^Yes. There are 2 improvements possible: (1) instead of zeroing RDX move the Div100Const in RDX and MUL RDX; (2) if RCR is slow then shift the dividend 1 (or 2) bits right and divide by 50 (or 25) - there would be no overflow so no need for RCR.Sanative
@user and Peter: I keep forgetting that I don't have to zero out RDX before a mult, only before I start a division chain.Chowchow
S
1

The code in the @Rudy's answer results from the following steps:

  1. Write 1/100 in binary form: 0.000000(10100011110101110000);
  2. Count leading zeroes after the decimal point: S = 6;
  3. 72 first significant bits are:

1010 0011 1101 0111 0000 1010 0011 1101 0111 0000 1010 0011 1101 0111 0000 1010 0011 1101

  1. Round to 65 bits; there is some kind of magic in how this rounding is performed; by reverse-engineering the constant from the Rudy's answer the correct rounding is:

1010 0011 1101 0111 0000 1010 0011 1101 0111 0000 1010 0011 1101 0111 0000 1010 1

  1. Remove the leading 1 bit:

0100 0111 1010 1110 0001 0100 0111 1010 1110 0001 0100 0111 1010 1110 0001 0101

  1. Write in hexadecimal form (getting back the revenged constant):

A = 47 AE 14 7A E1 47 AE 15

  1. X div 100 = (((uint128(X) * uint128(A)) shr 64) + X) shr 7 (7 = 1 + S)

Sanative answered 30/1, 2016 at 18:15 Comment(10)
I actually did something similar for 32 bit, but simpler. I divided $100000000 by 100 (I actually started with div 25 and then shr 2), and tried this with all cardinals. When that did not succeed, I used $200000000 and one shift extra, and repeated this until I found what I needed.Chowchow
FWIW, the rounding used seems to be quite simple: round up (away from zero). But why is the leading one bit removed? I still think that magic = $3FFFFFFFFFFFFFFFFF div 100 + 1 (you have that too: $A3D70A3... etc.) and a shift of 6 should work too. I just don't know how to prove or test it reliably.Chowchow
FWIW, their constant is $1C0000000000000000 div 100, mine is $400000000000000000 div 100. $1C = 28. So that is (28/64) * (64/100) = 28/100. Add to that 100/100 (X) and you have 128/100th of X. Shift that right once, and you have 64/100th. Shift that right 6 times and you get 1/100th. ISTM that my constant ($A3D70...) should work without the add/shift and give the exact same result. Also note that 64bit x 64bit can never overflow into 129 bits or some such, so there is no need for trickery like (Q + (X - Q) shr 1).Chowchow
@RudyVelthuis the leading 1 bit is removed because it is accounted by adding X in the final formula; the trick of adding X means that we use 65-bit constant instead of 64-bit as you tried in you first attempt; 64-bit constant is not enough to produce always correct division result in 64-bit case; I also not sure that 65-bit constant is enough in 64-bit case.Sanative
Hmmm... a 32 bit constant is good enough to reliably get the correct result in 32 bit. Perhaps not for all values, but apparently for div 100. I understood that X is added in.Chowchow
FWIW, instead of 100, I could use 25 for division and then shift right by 2 later on. That would avoid the 65th bit and would allow an even more accurate constant. I'll have to try that.Chowchow
Hmmm... libdivide gives the same constant value and code and only a shift of 4 instead of 6 for division by 25. Nothing gained.Chowchow
@RudyVelthuis - looks like 65 bits is always enough; actually (N+1)-bit constant is always enough for N-bit division, while N-bit constant is possible for some divisors only.Sanative
I'm currently reading this. That seems to be the base for the libdivide code: gmplib.org/~tege/divcnst-pldi94.pdfChowchow
@RudyVelthuis - Have read with this article, some thoughts: sergworks.wordpress.com/2016/02/01/integer-division-by-constantSanative

© 2022 - 2024 — McMap. All rights reserved.