x86 assembly (MASM) - Square root of a 64-bit integer?

Asked 10/1, 2012 at 19:2 Answered 28/7, 2019 at 7:24

I'm coding a simple primality tester program for Windows in x86 assembly language (MASM32), which involves calculating a square root of a (64-bit) integer. My question is: Is there any simple way for obtaining the square root? Should I use some combination of ADD/SUB/DIV/MUL instructions?

I found some information on how this could be achieved in C language, but I'm just wondering if I'm missing something here?

Airscrew answered 10/1, 2012 at 19:2 Comment(6)

You may be able to factor away the SQRT requirement by exploiting the fact that "x < SQRT(y)" is equivalent to "x*x < y". – Renita 10/1, 2012 at 19:7

@500-InternalServerError If the square root is used for determining the search range of potential factors (of a prime candidate, finding one would make it not prime), it would be much faster to do a single square root than square all of the x values in the entire range, 2 <= x < n where n is the prime candidate. – Cephalalgia 12/1, 2012 at 18:47

@Cephalalgia not really. no. You're going to search sequentially (usually in ascending order). So you find the sqrt (call it x) of the first number (call it n) then calculate and store (x+1)^2. ... keep x as the sqrt while you increase n looking for primes until n > (x+1)^2 ... then simply set x=x+1 and repeat. You only need to recalculate the sqr once for every time the ceil(sqrt(n)) changes. You only need to calculate an SQRT once to initialize your program. – Visigoth 28/8, 2013 at 22:30

using SSE instructions will be much faster than x87 – Discontinuance 22/2, 2014 at 13:20

@LưuVĩnhPhúc: no it won't, not in 32-bit code where the only hardware-supported way to convert 64-bit integers to floating point is x87 FILD/FISTP. SQRTSD is only slightly faster than FSQRT. Maybe if you can vectorize it, then SQRTPD can win for throughput. But also keep in mind that 80-bit x87 can exactly represent every int64_t, but double can't. See also my comment on the accepted answer. Or does MASM32 also let you make 64-bit binaries? Since 32x32 => 64 bit multiply is probably sufficient for this, it's not clear whether the OP is using 64-bit code or not. – Sisley 24/11, 2016 at 1:9

@Cephalalgia you can also utilize (x+2)^2 = x^2 + 4x + 4 to get the next odd square just by adding 4*x + 4 instead of checking x*x again and again in each step – Discontinuance 24/11, 2016 at 2:28

I think the simplest way is to use the FPU instruction fsqrt like this:

.data?
int64 dq ?
squareRoot dd ?

.code
fild int64        ;load the integer to ST(0)
fsqrt             ;compute square root and store to ST(0)
fistp squareRoot  ;store the result in memory (as a 32-bit integer) and pop ST(0)

Viscometer answered 10/1, 2012 at 19:5 Comment(5)

This is probably also the fastest way on modern hardware, at least for latency. In 32-bit mode, the only good way to convert a 64-bit integer to double is with x87. gcc's default -mfpmath=sse ends up bouncing the double between x87 and XMM twice: godbolt.org/g/sHDg4v much worse latency, probably increased throughput since SQRTSD has slightly better throughput than FSQRT (agner.org/optimize). (Of course, if you have multiple things to do, and you want throughput with no regard for latency, vectorize it. Probably scalar int->double conversions, then SQRTPD.) – Sisley 24/11, 2016 at 0:58

@PeterCordes but if compiled in 64-bit mode GCC will emit cvtsi2sdq; sqrtsd; cvttsd2si anyway without the need to switch back to x87 – Discontinuance 24/11, 2016 at 5:34

@LưuVĩnhPhúc: I was thinking this was a 32-bit question. Does MASM32 not imply that, or does it just mean "not 16"? – Sisley 24/11, 2016 at 9:25

@LưuVĩnhPhúc: SSE2 will be somewhat faster, but double can't exactly represent integers greater than 2^53 or so (size of mantissa). I'm not sure what that will do to the square-roots. It's possible that round-to-nearest ( SQRTSD( nearest-representable-double(i) )) will still always give you the right integer answer, but I haven't checked. This answer will always give the correct result, rounded to the nearest integer. (SSE3 FISTTP to truncate like a C cast to integer, instead of rounding with the default rounding mode.) – Sisley 24/11, 2016 at 9:31

I spent soo much time trying to understand why it's not working until I realized that I needed to put this [org 0x7c00] at the start of the script 🤦‍♂️🤦‍♂️🤦‍♂️ – Evadnee 26/9, 2022 at 19:34

Here is my original 64-bit Square root routine:

// X:int64    
// sqrt64(X)
....
  asm      
   push ESI  {preserve ESI,EDI and EBX}
   push EDI
   push EBX

   mov ESI,dword ptr X      
   mov EDI,dword ptr X+4

   xor Eax,Eax
   xor Ebx,Ebx
   mov cx,32
@next:
  // Add ESI,ESI   //old RCL ESI,1  -  Peter Cordes suggestion 
  // ADC EDI,EDI   //old RCL EDI,1     ~1.38x faster! 
  // ADC EBX,EBX   //old RCL Ebx,1     

   //Add ESI,ESI   //old RCL ESI,1
   //ADC EDI,EDI   //old RCL EDI,1
   //ADC EBX,EBX   //old RCL Ebx,1

   shld ebx, edi, 2   //-  Peter Cordes 41% speed up!
   shld edi, esi, 2
   lea esi, [esi*4]

   //mov EDX,EAX
   //Add EDX,EDX   //old shl Edx,1
   //stc
   //ADC EDX,EDX   //old RCL Edx,1    {+01}
   lea edx, [eax*4 + 1]   //-  Peter Cordes   +20% speed up

   cmp EBX,EDX   {if BX>=DX -> BX-DX}
  JC @skip
   sub EBX,EDX
@skip:
   cmc           {invert C}
   ADC EAX,EAX   //old  RCL Eax,1
   dec cx
   jnz @next     // -  Peter Cordes  +40% speed up
  //LOOP @next

   pop EBX
   pop EDI
   pop ESI

   mov result,Eax
 end;
....

Mcclelland answered 28/7, 2019 at 7:17 Comment(18)

rcl reg,1 is slower than adc reg,reg on most CPUs, especially modern AMD, and Intel Broadwell and later where adc is only 1 uop with 1c latency. (agner.org/optimize) – Sisley 28/7, 2019 at 7:28

Thanks, but I don't think I want to take the time to fully understand this code. In 32-bit mode especially, it's probably faster to just use x87 fsqrt on modern x86. I notice that your change wasn't equivalent. You replaced some RCL with ADD instead of ADC. Was that previously a bug where you were shifting in CF when you should have been shifting in zero to the low element of a 96-bit integer? – Sisley 28/7, 2019 at 16:42

Both codes work corectly in this case. When shifting is limited to size of X (here 32 loops*2=64), flag C does no matter. But.. If you want to compute more digits {for example you want to round the result} and you need one more digit after 0-bit and you continue loops then if flag C is undefined then the result may be incorrect. About speed: On mine Intel i3 3225 x87 fsqrt is 16x faster. So sorry.. – Mcclelland 28/7, 2019 at 20:38

Ok, so add is correct if you're not trying to shift bits between integers. On Intel, the loop instruction is quite slow, like 1 per 7 cycle throughput.. Looping with dec ecx / jnz is faster. Since you don't need to preserve CF across loop iterations, there's no downside even on older P6-family where that would cause a partial-flag stall inside a pure adc loop (Problems with ADC/SBB and INC/DEC in tight loops on some CPUs). – Sisley 28/7, 2019 at 20:57

Anyway, you're probably still not going to beat the FPU, especially with 32-bit code where it takes 3 registers / add + 2x adc to work with a 96-bit integer. But you could get closer. Also, like I said, adc is faster on AMD and current Intel than on your IvyBridge CPU: adc reg,reg is 2 uops / 2c latency there, vs. 1 uop / 1c latency on Broadwell/Skylake. Those later CPUs support a single uop having 3 inputs (2 registers + flags). – Sisley 28/7, 2019 at 21:0

If you're trying to save code-size by using 4-byte mov cx,32 instead of 5-byte mov ecx,32, you should instead use 3-byte lea ecx, [eax+32] (since you just xor-zeroed EAX). You weren't actually saving any net bytes because the extra operand-size prefix for dec cx eats up the 1-byte savings from the shorter immediate on mov. And if you care about performance, no reason to introduce a false dependency on the old value of CX by only writing the low 16 bits, or an LCP decoding stall on older Intel CPUs from the 16-bit immediate. TL:DR: normally prefer 32-bit operand-size. (or 8-bit) – Sisley 29/7, 2019 at 9:9

WOW ! Looping with dec ecx / jnz speed up code with 40%! Now 2^23 cicles take 592ms vs 1014ms with old loop. – Mcclelland 29/7, 2019 at 9:9

Oh, I hadn't noticed you were just repeating the same left-shifting by 1 twice. Consider using shld ebx, edi, 2 / shld edi, esi, 2 / lea esi, [esi*4] for the first shift. (SnB/IvyBridge have excellent SHLD performance: 1c latency and 0.5c throughput. Haswell and later are slower: only 1c throughput and 3c latency. So add/adc is prob. better on Broadwell and later.) I used lea with a disp32=0 instead of shl for the final shift because it can run on port 1 on IvB, unlike shl and shld which is p0 / p5 only. Also notice that shld doesn't have a data dependency chain. – Sisley 29/7, 2019 at 9:19

Also, mov EDX,EAX / add edx,edx (shift in zero) / stc/adc (shift in 1) for a total effect of shifting in a 01 can be replaced with lea edx, [eax*4 + 1] (1 uop, 1c latency). You might also want to make the update of eax branchless, IDK. If you're testing with the same value repeatedly, branch prediction probably learns the pattern and avoids branch misses. – Sisley 29/7, 2019 at 9:21

thanks for testing my improvements and editing in numbers; neat to see how much difference it makes. (But probably worth checking with perf counters how well that branch predicts; that might be the next time sink.) – Sisley 29/7, 2019 at 11:40

Peter Cordes, you're much more into assembler and computers, and I do not understand 100%. I read and corrected what I understood. Lastly, I did cycle speed tests with a value that I constantly increase. I use Borland Delphi 2005. Results for 2^23 (8.3 million) cycles: Delphi Round(Sqrt(x)) ~ 172ms; this code: ~ 570 ms For one value the processor really cheats! I'm sorry for the overwhelming comments but I do not have the right to chat. What is "perf counters"? – Mcclelland 29/7, 2019 at 19:10

perf counters = hardware performance counters. Like on Linux, perf stat ./my_program. Can x86's MOV really be "free"? Why can't I reproduce this at all? shows some examples of perf output. On Windows there's Intel VTune, I think... BTW, in English I think you mean "2^23 iterations". Your times are in milliseconds, not CPU core clock cycles. – Sisley 29/7, 2019 at 20:39

re: Sqrt being faster: yes that's expected. This takes 32 iterations, and each iteration is going to take more than 1 clock cycle. But according to Agner Fog's instruction tables, IvyBridge runs fsqrt with a throughput of one per 8 to 17 cycles. (With faster results for "simpler" values, which I think means fewer non-zero significant bits in the mantissa of the input and/or output, I'm not sure which matters. e.g. sqrt(1.0) or 0.0 should be faster than sqrt(1.2345678).) So your results make sense even if nothing is optimizing away across iterations of your repeat loop. – Sisley 29/7, 2019 at 20:43

re: using SHLD / SHLD / SHL instead of (add/adc/adc) twice: godbolt.org/z/UJqttV shows a compiler-generated example of using it on 64-bit integers in 32-bit code. The first part of your loop body is two 1-bit shifts on your 96-bit integer, using add/adc/adc. You can do both shifts at once using SHLD for extended-precision shifting. Use it to shift in 2 bits from the top of the next-lowest element. Your SnB/IvB is the only mainstream CPU where it's a huge win, though. It's slower on AMD, and it's slower on earlier and later Intel. – Sisley 29/7, 2019 at 20:50

Yes, that's what I though. So you timed 2^23 iterations (aka repeats of your function), and got the time in milliseconds for whatever frequency your CPU was running at. The time in core clock cycles would be about the same on any IvyBridge CPU, regardless of frequency. – Sisley 29/7, 2019 at 21:2

SHLD — Double Precision Shift Left - sorry, this is new for me. I was wondering what 96-bit integer you are talking about :) – Mcclelland 29/7, 2019 at 21:10

using SHLD we are closer ! final speed test: 405ms vs 172ms – Mcclelland 29/7, 2019 at 22:16

You're effectively using EBX:EDI:ESI as a 96-bit integer, or shifting bits from the top of a 64-bit integer into a 32-bit integer. I guess you figured that out, though. Ok, so 405ms down from your first version of 1014ms. That's pretty good. Is this with the same input number every time so the branch predicts perfectly? That probably makes this method look better than it would be in practical use-cases. – Sisley 29/7, 2019 at 23:52

You mean other than by using the FSQRT instruction? Sure, it's floating-point, but it should be easy enough to do the conversion. Otherwise you have to use something like Newton's Method.

Buggery answered 10/1, 2012 at 19:6 Comment(0)

There are numerous algorithms and techniques for calculating a square root of a number. In fact calculating the square root using Newton-Raphson's method is a standard assignment for Numeric Analysis students, as an example of the general case of finding the root of an equation.

Without profiling and benchmarking the code, and knowing whether you need a single or multiple square roots (SIMD calculations such as via SSE/SSE2), I would suggest you start with @Smi's answer, which uses the x87 FPU FSQRT instruction, as your baseline implementation. This does incur a load-store hit (quick summary: moving between FPU and CPU's ALU breaks caching and pipelines) which may negate the advantage of using the built-in FPU instruction.

Since you mentioned prime testing, I'm guessing that the sqrt is only used once per candidate number to determine the search range (any non-trivial factors are between 2 <= f <= sqrt(n), where n is the number being tested for primeness). If you are only testing specific numbers for primality it's okay, but for search lots of numbers you do square root for each candidate. If you are doing a "classic" test (pre- elliptic curve) it may not be worth worrying about.

Cephalalgia answered 10/1, 2012 at 19:51 Comment(2)

load-hit-store is not a thing on x86. The FPU is part of the CPU core. AMD actually recommends store/reload to move data between integer and XMM registers when optimizing for their CPUs (instead of using MOVD xmm0, eax, because it's really slow on AMD, since a pair of integer cores shares one vector/FP unit). (Agner Fog's microarch guide says he still found that MOVD contrary to AMD's advice.) – Sisley 24/11, 2016 at 1:17

I just tested on my Core2, and got an average of 7.333 cycles per load/store pair in a loop that did x87 (FILD/FISTP) then integer (MOV), then XMM (MOVQ) every iter. fild qword [rsp] / fistp qword [rsp] / mov rax, [rsp] / mov [rsp], rax / movq xmm0, [rsp] / movq [rsp], xmm0 / dec ecx / jnz .loop for 1e9 iterations took 22e9 clock cycles. So there are no huge penalties on Intel's Merom microarchitecture anyway. – Sisley 24/11, 2016 at 1:19

Finally I wrote, in addition to last function for 80386+ microprocessor's, that processes a 64 Bit unsigned-integer number's square-root, a new optimized function that works for 8086+ microprocessor, in assembly.

I've tested all these routines successfully.

They works as the first routine written by me (see at bottom of this article), but are more optimized. Because is easy, I shows an example of the 16 Bit's algorithm in Pascal as follows (half-section algorithm):

Function NewSqrt(X:Word):Word;

Var L,H,M:LongInt;

Begin
 L:=1;
 H:=X;
 M:=(X+1) ShR 1;
 Repeat
  If Sqr(M)>X Then
   H:=M
  Else
   L:=M;
  M:=(L+H) ShR 1;
 Until H-L<2;
 NewSqrt:=M;
End;

This my routine works through half-section algorithm and supports the square-root of a 64 Bit unsigned-integer number. Follows the 8086+ CPU new real-mode code:

Function NewSqrt16(XLow,XHigh:LongInt):LongInt; Assembler;

Var X0,X1,X2,X3,
    H0,H1,H2,H3,
    L0,L1,L2,L3,
    M0,M1,M2,M3:Word;

Asm

     LEA   SI,XLow
     Mov   AX,[SS:SI]
     Mov   BX,[SS:SI+2]

     LEA   SI,XHigh
     Mov   CX,[SS:SI]
     Mov   DX,[SS:SI+2]
    {------------------------
     INPUT: DX:CX:BX:AX = X
     OUPUT: DX:AX= Sqrt(X).
     ------------------------
         X0    :    X1    :    X2    :    X3    -> X
         H0    :    H1    :    H2    :    H3    -> H
         L0    :    L1    :    L2    :    L3    -> L
         M0    :    M1    :    M2    :    M3    -> M
         AX    ,    BX    ,    CX    ,    DX    ,
         ES    ,    DI    ,    SI               -> Temp. reg.
     ------------------------}
     Mov   [X0],AX           {  ...}
     Mov   [X1],BX           {  ...}
     Mov   [X2],CX           {  ...}
     Mov   [X3],DX           {  Stack <- X}

     Mov   [H0],AX           {  ...}
     Mov   [H1],BX           {  ...}
     Mov   [H2],CX           {  ...}
     Mov   [H3],DX           {  Stack <- H= X}

     Mov   [L0],Word(1)      {  ...}
     Mov   [L1],Word(0)      {  ...}
     Mov   [L2],Word(0)      {  ...}
     Mov   [L3],Word(0)      {  Stack <- L= 1}

     Add   AX,1              {  ...}
     AdC   Bx,0              {  ...}
     AdC   Cx,0              {  ...}
     AdC   Dx,0              {  X= X+1}

     RCR   Dx,1              {  ...}
     RCR   Cx,1              {  ...}
     RCR   Bx,1              {  ...}
     RCR   Ax,1              {  X= (X+1)/2}

     Mov   [M0],AX           {  ...}
     Mov   [M1],BX           {  ...}
     Mov   [M2],CX           {  ...}
     Mov   [M3],DX           {  Stack <- M= (X+1)/2}
    {------------------------}
@@LoopBegin:                 {Loop restart label}

     Mov   AX,[M3]           {If M is more ...}
     Or    AX,[M2]           {... then 32 bit ...}
     JNE   @@LoadMid         {... then Square(M)>X, jump}
    {DX:AX:CX:SI= 64 Bit square(Low(M))}
     Mov   AX,[M0]           {AX <- A=Low(M)}
     Mov   CX,AX             {CX <- A=Low(M)}

     Mul   AX                {DX:AX <- A*A}

     Mov   SI,AX             {SI <- Low 16 bit of last mul.}
     Mov   BX,DX             {BX:SI <- A*A}

     Mov   AX,[M1]           {AX <- D=High(M)}
     XChg  CX,AX             {AX <- A=Low(M); CX <- D=High(M)}

     Mul   CX                {DX:AX <- A*D=Low(M)*High(M)}

     XOr   DI,DI             {...}
     ShL   AX,1              {...}
     RCL   DX,1              {...}
     RCL   DI,1              {DI:DX:AX <- A*D+D*A= 2*A*D (33 Bit}

     Add   AX,BX             {...}
     AdC   DX,0              {...}
     AdC   DI,0              {DI:DX:AX:SI <- A*(D:A)+(D*A) ShL 16 (49 Bit)}

     XChg  CX,AX             {AX <- D=High(M); CX <- Low 16 bit of last mul.}
     Mov   BX,DX             {DI:BX:CX:SI <- A*(D:A)+(D*A) ShL 16 (49 Bit)}

     Mul   AX                {DX:AX <- D*D}

     Add   AX,BX             {...}
     AdC   DX,DI             {DX:AX:CX:SI <- (D:A)*(D:A)}
    {------------------------}
     Cmp   DX,[X3]           {Compare High(Square(M)):High(X)}

@@LoadMid:                   {Load M in DX:BP:BX:DI}

     Mov   DI,[M0]           {...}
     Mov   BX,[M1]           {...}
     Mov   ES,[M2]           {...}
     Mov   DX,[M3]           {DX:ES:BX:DI <- M}

     JA    @@SqrMIsMoreThenX {If High(Square(M))>High(X) then Square(M)>X, jump}
     JB    @@SqrMIsLessThenX {If High(Square(M))<High(X) then Square(M)<X, jump}

     Cmp   AX,[X2]           {Compare High(Square(M)):High(X)}
     JA    @@SqrMIsMoreThenX {If High(Square(M))>High(X) then Square(M)>X, jump}
     JB    @@SqrMIsLessThenX {If High(Square(M))<High(X) then Square(M)<X, jump}

     Cmp   CX,[X1]           {Compare High(Square(M)):High(X)}
     JA    @@SqrMIsMoreThenX {If High(Square(M))>High(X) then Square(M)>X, jump}
     JB    @@SqrMIsLessThenX {If High(Square(M))<High(X) then Square(M)<X, jump}

     Cmp   SI,[X0]           {Compare Low(Square(M)):Low(X)}
     JA    @@SqrMIsMoreThenX {If Low(Square(M))>Low(X) then Square(M)>X, jump}
    {------------------------}
@@SqrMIsLessThenX:           {Square(M)<=X}

     Mov   [L0],DI           {...}
     Mov   [L1],BX           {...}
     Mov   [L2],ES           {...}
     Mov   [L3],DX           {L= M}

     Jmp   @@ProcessMid      {Go to process the mid value}
    {------------------------}
@@SqrMIsMoreThenX:           {Square(M)>X}

     Mov   [H0],DI           {...}
     Mov   [H1],BX           {...}
     Mov   [H2],ES           {...}
     Mov   [H3],DX           {H= M}
    {------------------------}
@@ProcessMid:                {Process the mid value}

     Mov   SI,[H0]           {...}
     Mov   CX,[H1]           {...}
     Mov   AX,[H2]           {...}
     Mov   DX,[H3]           {DX:AX:CX:SI <- H}

     Mov   DI,SI             {DI <- H0}
     Mov   BX,CX             {BX <- H1}

     Add   SI,[L0]           {...}
     AdC   CX,[L1]           {...}
     AdC   AX,[L2]           {...}
     AdC   DX,[L3]           {DX:AX:CX:SI <- H+L}

     RCR   DX,1              {...}
     RCR   AX,1              {...}
     RCR   CX,1              {...}
     RCR   SI,1              {DX:AX:CX:SI <- (H+L)/2}

     Mov   [M0],SI           {...}
     Mov   [M1],CX           {...}
     Mov   [M2],AX           {...}
     Mov   [M3],DX           {M <- DX:AX:CX:SI}
    {------------------------}
     Mov   AX,[H2]           {...}
     Mov   DX,[H3]           {DX:AX:BX:DI <- H}

     Sub   DI,[L0]           {...}
     SbB   BX,[L1]           {...}
     SbB   AX,[L2]           {...}
     SbB   DX,[L3]           {DX:AX:BX:DI <- H-L}

     Or    BX,AX             {If (H-L) >= 65536 ...}
     Or    BX,DX             {...}
     JNE   @@LoopBegin       {... Repeat @LoopBegin else goes forward}

     Cmp   DI,2              {If (H-L) >= 2 ...}
     JAE   @@LoopBegin       {... Repeat @LoopBegin else goes forward}
    {------------------------}    
     Mov   AX,[M0]           {...}
     Mov   DX,[M1]           {@Result <- Sqrt}

End;

This function in input receives a 64 Bit number (XHigh:XLow) and return the 32 Bit square-root of it. Uses four local variables:

X, the copy of input number, subdivided in four 16 Bit packages (X3:X2:X1:X0).
H, upper limit, subdivided in four 16 Bit packages (H3:H2:H1:H0).
L, lower limit, subdivided in four 16 Bit packages (L3:L2:L1:L0).
M, mid value, subdivided in four 16 Bit packages (M3:M2:M1:M0).

Initializes the lower limit L to 1; initializes the upper limit H to input number X; initializes the mid value M to (H+1)>>1. Test if the square of M is long more then 64 Bit by verifying if (M3 | M2)!=0; if it is true then square(M)>X, sets the upper limit H to M. If it isn't true then processes the square of the lower 32 Bits of M (M1:M0) as follows:

(M1:M0)*(M1:M0)=

M0*M0+((M0*M1)<<16)+((M1*M0)<<16)+((M1*M1)<<32)=
M0*M0+((M0*M1)<<17)+((M1*M1)<<32)

If the square of lower 32 Bits of M is greater then X, sets the upper limit H to M; if the square of lower 32 Bits of M is lower or equal then X value, sets the lower limit L to M. Processes the mid value M setting it to (L+H)>>1. If (H-L)<2 then M is the square root of X, else go to test if the square of M is long more then 64 Bit and runs the follows instructions.

This my routine works through half-section algorithm and supports the square-root of a 64 Bit unsigned-integer number. Follows the 80386+ CPU new protected-mode code:

Procedure NewSqrt32(X:Int64;Var Y:Int64); Assembler;

{INPUT: EDX:EAX = X
  TEMP: ECX
 OUPUT: Y = Sqrt(X)}

Asm

     Push  EBX               {Save EBX register into the stack}
     Push  ESI               {Save ESI register into the stack}
     Push  EDI               {Save EDI register into the stack}
    {------------------------}
     Mov   EAX,Y             {Load address of output var. into EAX reg.}

     Push  EAX               {Save address of output var. into the stack}
    {------------------------}
     LEA   EDX,X             {Load address of input var. into EDX reg.}
     Mov   EAX,[EDX]         {    EAX <- Low 32 Bit of input value (X)}
     Mov   EDX,[EDX+4]       {EDX:EAX <- Input value (X)}
    {------------------------
     [ESP+ 4]:[ESP   ]        -> X
      EBX    : ECX            -> M
      ESI    : EDI            -> L
     [ESP+12]:[ESP+ 8]        -> H
      EAX    , EDX            -> Temporary registers
     ------------------------}
     Mov   EDI,1             {    EDI <-  Low(L)= 1}
     XOr   ESI,ESI           {    ESI <- High(L)= 0}

     Mov   ECX,EAX           {    ECX <- Low 32 Bit of input value (X)}
     Mov   EBX,EDX           {EBX:ECX <-       M= Input value (X)}

     Add   ECX,EDI           {    ECX <- Low 32 Bit of (X+1)}
     AdC   EBX,ESI           {EBX:ECX <-       M= X+1}

     RCR   EBX,1             {    EBX <- High 32 Bit of (X+1)/2}
     RCR   ECX,1             {EBX:ECX <-       M= (X+1)/2}
    {------------------------}
     Push  EDX               {  Stack <- High(H)= High(X)}
     Push  EAX               {  Stack <-  Low(H)=  Low(X)}

     Push  EDX               {  Stack <- High(X)}
     Push  EAX               {  Stack <-  Low(X)}
    {------------------------}    
@@LoopBegin:                 {Loop restart label}

     Cmp   EBX,0             {If M is more then 32 bit ...}
     JNE   @@SqrMIsMoreThenX {... then Square(M)>X, jump}

     Mov   EAX,ECX           {EAX     <- A= Low(M)}

     Mul   ECX               {EDX:EAX <- 64 Bit square(Low(M))}

     Cmp   EDX,[ESP+4]       {Compare High(Square(M)):High(X)}
     JA    @@SqrMIsMoreThenX {If High(Square(M))>High(X) then Square(M)>X, jump}
     JB    @@SqrMIsLessThenX {If High(Square(M))<High(X) then Square(M)<X, jump}

     Cmp   EAX,[ESP]         {Compare Low(Square(M)):Low(X)}
     JA    @@SqrMIsMoreThenX {If Low(Square(M))>Low(X) then Square(M)>X, jump}
    {------------------------}
@@SqrMIsLessThenX:           {Square(M)<=X}

     Mov   EDI,ECX           {Set Low 32 Bit of L with Low 32 Bit of M}
     Mov   ESI,EBX           {ESI:EDI <- L= M}

     Jmp   @@ProcessMid      {Go to process the mid value}
    {------------------------}
@@SqrMIsMoreThenX:           {Square(M)>X}

     Mov   [ESP+8],ECX       {Set Low 32 Bit of H with Low 32 Bit of M}
     Mov   [ESP+12],EBX      {H= M}
    {------------------------}
@@ProcessMid:                {Process the mid value}

     Mov   EAX,[ESP+8]       {EAX     <- Low 32 Bit of H}
     Mov   EDX,[ESP+12]      {EDX:EAX <- H}

     Mov   ECX,EDI           {Set Low 32 Bit of M with Low 32 Bit of L}
     Mov   EBX,ESI           {EBX:ECX <- M= L}

     Add   ECX,EAX           {Set Low 32 Bit of M with Low 32 Bit of (L+H)}
     AdC   EBX,EDX           {EBX:ECX <- M= L+H}

     RCR   EBX,1             {Set High 32 Bit of M with High 32 Bit of (L+H)/2}
     RCR   ECX,1             {EBX:ECX <- M= (L+H)/2}
    {------------------------}
     Sub   EAX,EDI           {EAX     <- Low 32 Bit of (H-L)}
     SbB   EDX,ESI           {EDX:EAX <- H-L}

     Cmp   EDX,0             {If High 32 Bit of (H-L) aren't zero: (H-L) >= 2 ...}
     JNE   @@LoopBegin       {... Repeat @LoopBegin else goes forward}

     Cmp   EAX,2             {If (H-L) >= 2 ...}
     JAE   @@LoopBegin       {... Repeat @LoopBegin else goes forward}
    {------------------------}
     Add   ESP,16            {Remove 4 local 32 bit variables from stack}
    {------------------------}
     Pop   EDX               {Load from stack the output var. address into EDX reg.}

     Mov   [EDX],ECX         {Set Low 32 Bit of output var. with Low 32 Bit of Sqrt(X)}
     Mov   [EDX+4],EBX       {Y= Sqrt(X)}
    {------------------------}
     Pop   EDI               {Retrieve EDI register from the stack}
     Pop   ESI               {Retrieve ESI register from the stack}
     Pop   EBX               {Retrieve EBX register from the stack}

End;

This my 8086+ CPU real-mode routine works through half-section algorithm and supports the square-root of a 32 Bit unsigned-integer number; is not too fast, but may be optimized:

Procedure _PosLongIMul2; Assembler;

{INPUT:

 DX:AX-> First factor (destroyed).
 BX:CX-> Second factor (destroyed).

 OUTPUT:

 BX:CX:DX:AX-> Multiplication result.

 TEMP:

 BP, Di, Si}

Asm

     Jmp   @Go

 @VR:DD    0      {COPY of RESULT     (LOW)}
     DD    0      {COPY of RESULT    (HIGH)}

 @Go:Push  BP

     Mov   BP,20H {32 Bit Op.}

     XOr   DI,DI  {COPY of first op.  (LOW)}
     XOr   SI,SI  {COPY of first op. (HIGH)}

     Mov   [CS:OffSet @VR  ],Word(0)
     Mov   [CS:OffSet @VR+2],Word(0)
     Mov   [CS:OffSet @VR+4],Word(0)
     Mov   [CS:OffSet @VR+6],Word(0)

 @01:ShR   BX,1
     RCR   CX,1

     JAE   @00

     Add   [CS:OffSet @VR  ],AX
     AdC   [CS:OffSet @VR+2],DX
     AdC   [CS:OffSet @VR+4],DI
     AdC   [CS:OffSet @VR+6],SI

 @00:ShL   AX,1
     RCL   DX,1
     RCL   DI,1
     RCL   SI,1

     Dec   BP
     JNE   @01

     Mov   AX,[CS:OffSet @VR]
     Mov   DX,[CS:OffSet @VR+2]
     Mov   CX,[CS:OffSet @VR+4]
     Mov   BX,[CS:OffSet @VR+6]

     Pop   BP

End;

Function _Sqrt:LongInt; Assembler;

{INPUT:

 Di:Si-> Unsigned integer input number X.

 OUTPUT:

 DX:AX-> Square root Y=Sqrt(X).

 TEMP:

 BX, CX}

Asm

     Jmp   @Go

 @Vr:DD    0 {+0  LOW}
     DD    0 {+4  HIGH}
     DD    0 {+8  Mid}
     DB    0 {+12 COUNT}

 @Go:Mov   [CS:OffSet @Vr],Word(0)
     Mov   [CS:OffSet @Vr+2],Word(0)

     Mov   [CS:OffSet @Vr+4],SI
     Mov   [CS:OffSet @Vr+6],DI

     Mov   [CS:OffSet @Vr+8],Word(0)
     Mov   [CS:OffSet @Vr+10],Word(0)

     Mov   [CS:OffSet @Vr+12],Byte(32)

 @02:Mov   AX,[CS:OffSet @Vr]
     Mov   DX,[CS:OffSet @Vr+2]

     Mov   CX,[CS:OffSet @Vr+4]
     Mov   BX,[CS:OffSet @Vr+6]

     Add   AX,CX
     AdC   DX,BX

     ShR   DX,1
     RCR   AX,1

     Mov   [CS:OffSet @Vr+8],AX
     Mov   [CS:OffSet @Vr+10],DX

     Mov   CX,AX
     Mov   BX,DX

     Push  DI
     Push  SI

     Call  _PosLongIMul2

     Pop   SI
     Pop   DI

     Or    BX,CX
     JNE   @00

     Cmp   DX,DI
     JA    @00
     JB    @01

     Cmp   AX,SI
     JA    @00

 @01:Mov   AX,[CS:OffSet @Vr+8]
     Mov   [CS:OffSet @Vr],AX

     Mov   DX,[CS:OffSet @Vr+10]
     Mov   [CS:OffSet @Vr+2],DX

     Dec   Byte Ptr [CS:OffSet @Vr+12]
     JNE   @02
     JE    @03

 @00:Mov   AX,[CS:OffSet @Vr+8]
     Mov   [CS:OffSet @Vr+4],AX

     Mov   DX,[CS:OffSet @Vr+10]
     Mov   [CS:OffSet @Vr+6],DX

     Dec   Byte Ptr [CS:OffSet @Vr+12]
     JNE   @02

 @03:Mov   AX,[CS:OffSet @Vr+8]
     Mov   DX,[CS:OffSet @Vr+10]

End;

This my 8086+ CPU real-mode routine returns a 32 Bit fixed-point number that is the square-root of a 16 Bit unsigned-integer input number; is not too fast, but may be optimized:

Function _Sqrt2:LongInt; Assembler;

{INPUT:

 Si-> Unsigned integer number X (unaltered).

 OUTPUT:

 AX-> Integer part of Sqrt(X).
 DX-> Decimal part of Sqrt(X).

 TEMP:

 BX, CX, DX, Di}

Asm

     Jmp   @Go

 @Vr:DD    0 {+0  LOW}
     DD    0 {+4  HIGH}
     DD    0 {+8  Mid}
     DB    0 {+12 COUNT}

 @Go:Mov   [CS:OffSet @Vr],Word(0)
     Mov   [CS:OffSet @Vr+2],Word(0)

     Mov   [CS:OffSet @Vr+4],Word(0)
     Mov   [CS:OffSet @Vr+6],SI

     Mov   [CS:OffSet @Vr+8],Word(0)
     Mov   [CS:OffSet @Vr+10],Word(0)

     Mov   [CS:OffSet @Vr+12],Byte(32)

 @02:Mov   AX,[CS:OffSet @Vr]
     Mov   DX,[CS:OffSet @Vr+2]

     Mov   CX,[CS:OffSet @Vr+4]
     Mov   BX,[CS:OffSet @Vr+6]

     Add   AX,CX
     AdC   DX,BX

     ShR   DX,1
     RCR   AX,1

     Mov   [CS:OffSet @Vr+8],AX
     Mov   [CS:OffSet @Vr+10],DX

     Mov   CX,AX
     Mov   BX,DX

     Push  SI

     Call  _PosLongIMul2

     Pop   SI

     Or    BX,BX
     JNE   @00

     Cmp   CX,SI
     JB    @01
     JA    @00

     Or    DX,AX
     JNE   @00

 @01:Mov   AX,[CS:OffSet @Vr+8]
     Mov   [CS:OffSet @Vr],AX

     Mov   DX,[CS:OffSet @Vr+10]
     Mov   [CS:OffSet @Vr+2],DX

     Dec   Byte Ptr [CS:OffSet @Vr+12]
     JNE   @02
     JE    @03

 @00:Mov   AX,[CS:OffSet @Vr+8]
     Mov   [CS:OffSet @Vr+4],AX

     Mov   DX,[CS:OffSet @Vr+10]
     Mov   [CS:OffSet @Vr+6],DX

     Dec   Byte Ptr [CS:OffSet @Vr+12]
     JNE   @02

 @03:Mov   DX,[CS:OffSet @Vr+8]
     Mov   AX,[CS:OffSet @Vr+10]

End;

Hi!

Morehouse answered 1/10, 2017 at 15:12 Comment(3)

It's cool that you're having fun solving these problems, but this is not a useful answer for anything except 8086 with no FPU. Is there an 8086 / 16-bit question you could post this on? It would be better to use meaningful label names, not @02. And why are you using static storage in the code segment? Also, if you just want to check if the product of two 32-bit numbers is > 2^32, you should use a couple mul instructions instead of calling your very slow _PosLongIMul2 that uses rotates. As a quick early-out, look at the high-half * high-half product. – Sisley 1/10, 2017 at 20:35

And BTW, or bx,bx is worse than test bx,bx. Writing to bxmakes your code slower, because it lengthens the dependency chain involving bx, uses up register-renaming resources, and also test/jcc can macro-fuse into a single test-and-branch uop but or/jcc can't. – Sisley 1/10, 2017 at 20:38

My first version of the partial solution of the problem asked was inefficient and obviously it worked on 8086+ microprocessors with no FPU. I've rewritten it for 80386+. – Morehouse 3/10, 2017 at 18:28

For a naive prime-testing loop condition, you can use the quotient from trial division to find out when your divisor has passed the sqrt: if (x / d > d) break;

Checking if a number is prime in NASM Win64 Assembly/ has more detail, including an implementation of that in x86 asm. You're doing that division anyway to check for remainder == 0, so you get the quotient for free.

In 64-bit code, normal integer division can be 64-bit. (But 64-bit operand-size for div is a lot slower than 32-bit on Intel so use 32-bit when possible.)

Sisley answered 28/7, 2019 at 7:24 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags