Fastest Integer Square Root in the least amount of instructions

Asked 29/6, 2015 at 13:52 Answered 22/6, 2019 at 16:47

I am in a need of fast integer square root that does not involve any explicit division. The target RISC architecture can do operations like add, mul, sub, shift in one cycle (well - the operation's result is written in third cycle, really - but there's interleaving), so any Integer algorithm that uses these ops and is fast would be very appreciated.

This is what I have right now and I'm thinking that a binary search should be faster, since the following loop executes 16 times every single time (regardless of the value). I haven't debugged it extensively yet (but soon), so perhaps it's possible to have an early exit there:

unsigned short int int_sqrt32(unsigned int x)
{
    unsigned short int res=0;
    unsigned short int add= 0x8000;   
    int i;
    for(i=0;i<16;i++)
    {
        unsigned short int temp=res | add;
        unsigned int g2=temp*temp;      
        if (x>=g2)
        {
            res=temp;           
        }
        add>>=1;
    }
    return res;
}

Looks like the current performance cost of the above [in the context of the target RISC] is a loop of 5 instructions (bitset, mul, compare, store, shift). Probably no space in cache to unroll fully (but this will be the prime candidate for a partial unroll [e.g. A loop of 4 rather than 16], for sure). So, the cost is 16*5 = 80 instructions (plus loop overhead, if not unrolled). Which, if fully interleaved, would cost only 80 (+2 for last instruction) cycles.

Can I get some other sqrt implementation (using only add, mul, bitshift, store/cmp) under 82 cycles?

FAQ:

Why don't you rely on the compiler to produce a good fast code?

There is no working C → RISC compiler for the platform. I will be porting the current reference C code into hand-written RISC ASM.

Did you profile the code to see if sqrt is actually a bottleneck?

No, there is no need for that. The target RISC chip is about twenty MHz, so every single instruction counts. The core loop (calculating the energy transfer form factor between the shooter and receiver patch), where this sqrt is used, will be run ~1,000 times each rendering frame (assuming it will be fast enough, of course), up to 60,000 per second, and roughly 1,000,000 times for whole demo.

Have you tried to optimize the algorithm to perhaps remove the sqrt?

Yes, I did that already. In fact, I got rid of 2 sqrts already and lots of divisions (removed or replaced by shifting). I can see a huge performance boost (compared to the reference float version) even on my gigahertz notebook.

What is the application?

It's a real-time progressive-refinement radiosity renderer for the compo demo. The idea is to have one shooting cycle each frame, so it would visibly converge and look better with each rendered frame (e.g. Up 60-times per second, though the SW rasterizer won't probably be that fast [but at least it can run on the other chip in parallel with the RISC - so if it takes 2-3 frames to render the scene, the RISC will have worked through 2-3 frames of radiosity data, in parallel]).

Why don't you work directly in target ASM?

Because radiosity is a slightly involved algorithm and I need the instant edit & continue debugging capability of Visual Studio. What I've done over the weekend in VS (couple hundred code changes to convert the floating-point math to integer-only) would take me 6 months on the target platform with only printing debugging".

Why can't you use a division?

Because it's 16-times slower on the target RISC than any of the following: mul, add, sub, shift, compare, load/store (which take just 1 cycle). So, it's used only when absolutely required (a couple times already, unfortunately, when shifting could not be used).

Can you use look-up tables?

The engine needs other LUTs already and copying from main RAM to RISC's little cache is prohibitively expensive (and definitely not each and every frame). But, I could perhaps spare 128-256 Bytes if it gave me at least a 100-200% boost for sqrt.

What's the range of the values for sqrt?

I managed to reduce it to mere unsigned 32-bit int (4,294,967,295)

EDIT1: I have ported two versions into the target RISC ASM, so I now have an exact count of ASM instructions during the execution (for the test scene).
Number of sqrt calls: 2,800.
Method1: The same method in this post (loop executing 16 times)
Method2: fred_sqrt (3c from http://www.azillionmonkeys.com/qed/sqroot.html)

Method1: 152.98 instructions per sqrt
Method2: 39.48 instructions per sqrt (with Final Rounding and 2 Newton iterations)
Method2: 21.01 instructions per sqrt (without Final Rounding and 2 Newton iterations)

Method2 uses LUT with 256 values, but since the target RISC can only use 32-bit access within its 4 KB cache, it actually takes 256*4 = 1 KB. But given its performance, I guess I will have to spare that 1 KB (out of 4).

Also, I have found out that there is NO visible visual artifact when I disable the Final rounding and two Newton iterations at the end (of Method2). Meaning, the precision of that LUT is apparently good enough. Who knew...
The final cost is then 21.01 instructions per sqrt, which is almost ~order of magnitude faster than the very first solution. There's also possibility of reducing it further by sacrificing few of the 32 available registers for the constants used for the conditions and jump labels (each condition must fill 2 registers - one with the actual constant (only values less than 16 are allowed within CMPQ instruction, larger ones must be put into register) we are comparing against and second for the jump to the else label (the then jump is fall-through), as the direct relative jump is only possible within ~10 instructions (impossible with such large if-then-else chain, other than innermost 2 conditions).

EDIT2: ASM micro-optimizations
While benchmarking, I added counters for each of the 26 If.Then.Else codeblocks, to see if there aren't any blocks executed most often. Turns out, that Blocks 0/10/11 are executed in 99.57%/99.57%/92.57% of cases. This means I can justify sacrificing 3 registers (out of 32) for those comparison constants (in those 3 blocks), e.g. r26 = $1.0000 r25 = $100.0000 r24 = $10.0000
This brought down the total instruction cost from 58,812 (avg:21.01) to 50,448 (avg:18.01)

So, now the average sqrt cost is just 18.01 ASM instructions (and there is no division!), though it will have to be inlined.

EDIT3: ASM micro-optimizations
Since we know that those 3 blocks (0/10/11) are executed most often, we can use local short jumps (16 Bytes in both directions, which is usually just couple of instructions (hence mostly useless), especially when the 6-byte MOVEI #jump_label, register is used during conditions) in those particular conditions. Of course, the Else condition will then incur additional 2 ops (that it would not have otherwise), but that's worth it. The block 10 will have to be swapped (Then block with Else block), which will make it harder to read and debug, but I documented the reasons profusely.

Now the total instruction cost (in test scene) is just 42,500 with an average of 15.18 ASM instructions per sqrt.

EDIT4: ASM micro-optimizations
Block 11 condition splits into innermost Blocks 12&13. It just so happens that those blocks don't need additional +1 math op, hence the local short jump can actually reach the Else block, if I merge bitshift right with the necessary bitshift left #2 (as all offsets within cache must be 32-bit). This saves on filling the jump register though I do need to sacrifice one more register r23 for the comparison value of $40.000.

The final cost is then 34,724 instructions with an average of 12.40 ASM instructions per sqrt.

I am also realizing that I could reshuffle the order of conditions (which will make the other range few ops more expensive, but that's happening for ~7% of cases only), favoring this particular range ($10.000, $40.000) first, saving on at least 1 or maybe even 2 conditions.
In which case, it should fall down to ~8.40 per sqrt.
I am realizing that the range depends directly on intensity of the light and the distance to the wall. Meaning, I have direct control over the RGB value of the light and distance from the wall. And while I would like the solution to be as generic as possible, given this realization (~12 ops per sqrt is mind-blowing), I will gladly sacrifice some flexibility in light colors if I can get sqrt this fast. Besides, there's maybe 10-15 different lights in whole demo, so I can simply find color combinations that eventually result in same sqrt range, but will get insanely fast sqrt. For sure, that's worth it to. And I still have a generic fallback (spanning entire int range) working just fine. Best of both worlds, really.

Inclose answered 29/6, 2015 at 13:52 Comment(20)

If your target is ARM or ARM-like you might want to have a look at finesse.demon.co.uk/steven/sqrt.html for ideas - I'm pretty sure when I used to program ARM in hand written assembly (a long while ago) there was an example of how to do this (along with divide) in the Archimedes manual. I would doubt binary search is faster than Newton Raphson or similar. – Glanders 29/6, 2015 at 13:57

A long time ago in a galaxy far away when I was coding silly intros for 80386, I used the Newton algorithm. It still requires division but if you choose a smart starting value, not too many. – Coriss 29/6, 2015 at 14:7

if you have acess to floats.. how about a simple aproximation en.wikipedia.org/wiki/… – Oogenesis 29/6, 2015 at 14:20

I did some tests with the Newton-Raphson method and unfortunately even if you choose your starting value to be the square root of the highest order bit you'll have to do 4 iterations on average. Which is 64 cycles spent with division alone, so it's a non-starter. – Coriss 29/6, 2015 at 14:57

I'm going out on a limb there but what is the distribution of your numbers? If many of them are small, maybe it's worth counting how many bits the result will have first and only run the loop that many iterations. (Of course this only applies if you've got clz or something similar as a 1 cycle instruction.) – Coriss 29/6, 2015 at 15:4

Am I the only one to wonder why the answers come before the questions in the FAQ? – Glanders 29/6, 2015 at 16:19

@Oogenesis You missed the part where I explain I switched from floats to ints. – Inclose 29/6, 2015 at 17:52

@Coriss I had an algorithm that used divisions (first) but replaced it with my current one - for exactly same reason that you figured out. As for FAQ order - I find it better to prime the brain by reading response before the question :-) – Inclose 29/6, 2015 at 17:56

@Coriss You bring up a good point. Yesterday I got an idea that I can use the geometric information to help with reducing the number of loop iterations. While the form factors and square roots vary greatly across whole dataset, each wall is separate <1,024 to 32,768> form factors.I suspect that the number of loop executions for sqrt will be very similar and it might be perhaps hardcoded or specified per wall. Of course, on top of reference version that calculates every texel, I got an interpolated codepath that reduces it by 16:1, 256:1, 1024:1 (but that magnifies the integer inaccuracies). – Inclose 29/6, 2015 at 18:4

@Glanders That is an excellent link ! While my RISC is not an ARM, this will serve as a great comparison point. Even a rewrite should be easy - it's all just few basic instructions anyway. I merely glimpsed over it, but there was something that was 51 cycles, so I am definitely going to give this link a much deeper read-through. 51 cycles would be a fantastic improvement over my 82 (which is still just an estimate, based on my early uncompiled RISC drafts). – Inclose 29/6, 2015 at 18:13

You said divison is something you would like to avoid; I am unfamiliar how mod operator is implemented in assembly, but is it something you would like to avoid as well? – Featherston 30/6, 2015 at 0:8

Do you need the exact square root? It might be possible to knock off a few instructions if you are willing to settle for an approximate square root. As a crude example, finding the largest bit of the square root can be done very quickly and would give you an approximation within a factor of 2 of the correct answer. And this idea could be generalized, e.g., by finding only the 8 most significant bits of the square root, thereby cutting the computation time almost in half. – Halcomb 30/6, 2015 at 12:17

@AlexandruBarbarosie AFAIK mod is typically the same instruction (internally) as division. In x86 assembly there is no modulo instruction at all, only div, which places the result in one register and the modulo in another. – Coriss 30/6, 2015 at 15:21

@3DCoder Does this architecture have a CLZ (count leading zeros) instruction or something similar? – Wollis 30/6, 2015 at 15:43

@3DCoder Does the architecture offer a MULHI instruction (returns upper 32 bits of full 64-bit product of two 32-bit operands)? – Wollis 30/6, 2015 at 15:51

@Wollis The RISC is 32-bit - it works with 16-bit operands only for math operations (with results being 32-bit). But you can access any portion of the target register (be it upper/lower 8,16 bits). No instructions like you mention, and no modulo. Only basic add/sub/mult/shift and div (which costs 16 cycles). The closest I found is SAT24 saturate instruction - saturate the 32-bit signed integer to 24-bit unsigned. – Inclose 30/6, 2015 at 16:46

@Halcomb Great question. While it's immediately obvious that the vector calculations will become much less precise (as we are using imprecise sqrt), I suppose it's worth testing it out to see how the final computed radiosity will actually look like. Because while it might not be physically super-accurate anymore, for the purposes of the intro/demo, it should not matter (if it still looks good enough - who cares, right ?).If I remember correctly, there was one algorithm that was working, like, 3-4 bits of precision, at a time, so it should be easily tweakable to get only few iterations out of it. – Inclose 30/6, 2015 at 16:52

I gave this idea (of adjustable sqrt precision) a bit thought and I'm pretty sure it would be an improvement over my original 80-op sqrt. But now, that I got the ~18-op sqrt (average op cost over 5,120 calls to sqrt), I think it's going to be impossible to beat that (I'd get - what - about 5-6 bits of precision for 18 ops ?), plus it would be lower precision (and the 18-op solution is a full precision - I placed a breakpoint in a condition comparing 2 sqrt methods-it never got there). It's still great value for other applications, where I might not be able to spare 256 Bytes of LUT(for sqrt). – Inclose 30/6, 2015 at 17:50

instruction count isn't a performance measure. Some instructions can be tens or hundreds of times slower than others, like div or fsincos... You need to measure the whole snippet instead of checking which one has the least number of instructions – Minimal 10/4, 2019 at 16:15

@phuclv: Incorrect for the target platform. As I said in past, only DIV takes 16 cycles. All other used ops take exactly 3 cycles each (but if you inline and rearrange them carefully, you can get a throughput of 1 op per cycle). fsincos ? Seriously ? My RISC barely has LOAD/ADD/MUL/Bitshift :) – Inclose 29/6, 2020 at 8:27

Have a look here.

For instance, at 3(a) there is this method, which is trivially adaptable to do a 64->32 bit square root, and also trivially transcribable to assembler:

/* by Jim Ulery */
static unsigned julery_isqrt(unsigned long val) {
    unsigned long temp, g=0, b = 0x8000, bshft = 15;
    do {
        if (val >= (temp = (((g << 1) + b)<<bshft--))) {
           g += b;
           val -= temp;
        }
    } while (b >>= 1);
    return g;
}

No divisions, no multiplications, bit shifts only. However, the time taken will be somewhat unpredictable particularly if you use a branch (on ARM RISC conditional instructions would work).

In general, this page lists ways to calculate square roots. If you happen to want to produce a fast inverse square root (i.e. x**(-0.5) ), or are just interested in amazing ways to optimise code, take a look at this, this and this.

Glanders answered 29/6, 2015 at 16:12 Comment(17)

I'm still trying to find something I am trying to remember which was a fantastic estimation of a square root in a few instructions that was used on very early games. – Glanders 29/6, 2015 at 16:12

Found it, but it's inverse square roots. Link added for amusement value. – Glanders 29/6, 2015 at 16:18

Your code seems almost identical to mine. Except it's reorganized a bit (a while loop instead of for) and it looks like it has at one additional operation per loop. But there's too much on one line. Remember - I am going to convert to ASM, so no compiler magic (or avoiding the temporaries) will help. As for the inverse sq.root - yes - I'm aware of the trick popularized by Quake3, but it's of no use for integer math. – Inclose 29/6, 2015 at 17:41

However, your first link provides a method (section 3c) with a small LUT which has the potential to spend much less total time across all square roots (due to early exits). That is very intriguing ! I can modify it to keep increasing some global variable named TotalOps (after each cmp/add/bitshift operation), and see the grand total after I process the test dataset. True, it could, on average spend more time than 80 ops, but we won't know until I actually benchmark it. I just might get lucky ! Remember - only the total number of ops (which equal to RISC ASM instructions) is what matters. – Inclose 29/6, 2015 at 17:47

@3DCoder right; I couldn't resist posting the Quake3 thing - without knowledge of your platform, I was wondering whether a switch back to floating point might be faster if you can do the entire operation in a couple of instructions (assuming you want the inverse). – Glanders 29/6, 2015 at 18:44

So, I used the method in the section 3c with the small LUT (256 B) and did the "profiling" estimate using global variable TotalOps that tries to estimate the total number of RISC instructions - e.g. It doesn't include only the math instructions, but also assignments (1 op), conditions (2 ops for if, 1 op for else), and loops (2 ops). For test data I used a cube with 5 walls, each 32x32 = 1,024 texels, so 5,120 sqrts in total. My current method used 607,248 ops (118 on average), while the LUT method used just 91,704 (18 on average)! That's a factor of 6.62x ! Early exits are truly amazing ! – Inclose 29/6, 2015 at 21:53

This is big. The rammifications of the square root in 18 instructions are massive. Now, all kinds of effects, that I was avoiding due to the need of normalizing 3 vectors (the visuals of which actually depend on sqrt, so you can't just remove it), have become possible. Most of these effects have very little math (5-10 ops), besides the 3 vector normalizations (per pixel) - so the sqrt was always the primary reason for low performance. Not anymore :-) – Inclose 29/6, 2015 at 22:17

@3DCoder: If the context is vector normalization, wouldn't you want to compute the reciprocal square root instead (in some suitable fixed-point format)? Otherwise vector normalization would involve expensive division, would it not? – Wollis 30/6, 2015 at 18:13

@Wollis Yes,for this app, the context is vector normalization (though,I will now use it elsewhere) and I am using 3 expensive divisions (right after I get the sqrt). But I cannot really use fixed.point math on the RISC, only integer math, in which case the reciprocal square root seems pointless, as by definition you cannot have an unsigned integer number smaller than 1 (as you can't do div by 0). Perhaps there's a normalization equation for integer vectors without sqrt/div? I can try to look into it more. Of course,there's 50 links for floating point,I just could not find anything for integer. – Inclose 30/6, 2015 at 18:44

Now, I am actually doing a bit of fixed point math, as I am multiplying by 1,024 (by left-shifting) and dividing by 1,024 (by right-shifting), but that's about the extent of the fixed-point math I am willing to do on the RISC - as it's incredibly cheap (just 1 op). Once you want more precision and do actual fixed point math, then many more instructions are involved. Some time ago I did a full fixed-point library in C, but using it on the RISC looks problematic to me. Do you think it could result in less number of ops than the integer math ? – Inclose 30/6, 2015 at 19:0

A hundred years ago I did a lot of ARM32 fixed point arithmetic, mostly 16.16 (in one 32 bit register) and a little 32.32 (with two registers). Single register fixed point arithmetic is very little difference speed-wise to integer arithmetic (e.g. add is identical, mul requires an extra barrel shift or two). Your main challenge is keeping precision whilst avoiding overflow - that should be far easier with 64 bit registers using (e.g.) 32.32. – Glanders 30/6, 2015 at 19:32

I too have used 16.16 fixed-point computation on ARM32 (CPUs running about 200 MHz at the time), in the context of 3D graphics, and for vector normalization it is worth looking into, computing via rsqrt() instead of sqrt(). – Wollis 30/6, 2015 at 21:6

@Glanders Great point. When deciding on implementation there were 3 approaches - integer, fixed-point and using 2 registers to double the precision. I chose the first one, but am getting a feeling that fixed point might have resulted in less ops. Though, that could truly only be measured all 3 codepaths are written in actual ASM. I had my fair share of handling the two-register math couple decades ago and was hoping to avoid that, but it just might be the fastest solution (no re-normalization ops, no precision loss, but convoluted code). That's probably worth a separate question in itself. – Inclose 1/7, 2015 at 11:5

@Wollis Yeah, if I chose the fixed-point path, then rsqrt would have been an option. I suspect I should go and write another codepath (or just dig up the fixed-point class I wrote some time ago). As an aside, yesterday I managed to approximate the normalization, which got rid of 40% of ops of inner loop (no sqrt, no 3 divs, no renormalizations, no adds, no muls). Downside is that it has a direct visual impact, so it's not really a full solution. – Inclose 1/7, 2015 at 11:10

@3DCoder Out of interest, which RISC architecture is this that has no floating point (or no fast floating point)? I'd assumed my expertise here were long dead and buried. – Glanders 1/7, 2015 at 14:44

@Glanders Oh, this failed console is long dead and buried. no question about that :-) I do, however, like to take a challenge with a fixed HW and it's my childhood favourite brand, so I'm curious how far I can push it in 3D. – Inclose 1/7, 2015 at 16:54

Is the CPU a Hitachi SH-1 by any chance? Being specific about the CPU may lead to better answers by letting answers take advantage of specific features of the target architecture. I worked with an SH-4 in the early 2000s. – Wollis 7/7, 2015 at 22:0

This is the same as yours, but with fewer ops. (I count 9 ops in the loop in your code, including test and increment i in the for loop and 3 assignments, but perhaps some of those disappear when coded in ASM? There are 6 ops in the code below, if you count g*g>n as two (no assignment)).

int isqrt(int n) {
  int g = 0x8000;
  int c = 0x8000;
  for (;;) {
    if (g*g > n) {
      g ^= c;
    }
    c >>= 1;
    if (c == 0) {
      return g;
    }
    g |= c;
  }
}

I got it here. You can maybe eliminate a comparison if you unroll the loop and jump to the appropriate spot based on the highest non-zero bit in the input.

Update

I've been thinking more about using Newton's method. In theory, the number of bits of accuracy should double for each iteration. That means Newton's method is much worse than any of the other suggestions when there are few correct bits in the answer; however, the situation changes where there are a lot of correct bits in the answer. Considering that most suggestions seem to take 4 cycles per bit, that means that one iteration of Newton's method (16 cycles for division + 1 for addition + 1 for shift = 18 cycles) is not worthwhile unless it gives over 4 bits.

So, my suggestion is to build up 8 bits of the answer by one of the suggested methods (8*4 = 32 cycles) then perform one iteration of Newton's method (18 cycles) to double the number of bits to 16. That's a total of 50 cycles (plus maybe an extra 4 cycles to get 9 bits before applying Newton's method, plus maybe 2 cycles to overcome the overshoot occasionally experienced by Newton's method). That's a maximum of 56 cycles which as far as I can see rivals any of the other suggestions.

Second Update

I coded the hybrid algorithm idea. Newton's method itself has no overhead; you just apply and double the number of significant digits. The issue is to have a predictable number of significant digits before you apply Newton's method. For that, we need to figure out where the most significant bit of the answer will appear. Using a modification of the fast DeBruijn sequence method given by another poster, we can perform that calculation in about 12 cycles in my estimation. On the other hand, knowing the position of the msb of the answer speeds up all methods (average, not worst case), so it seems worthwhile no matter what.

After calculating the msb of the answer, I run a number of rounds of the algorithm suggested above, then finish it off with one or two rounds of Newton's method. We decide when to run Newton's method by the following calculation: one bit of the answer takes about 8 cycles according to calculation in the comments; one round of Newton's method takes about 18 cycles (division, addition, and shift, and maybe assignment), so we should only run Newton's method if we're going to get at least three bits out of it. So for 6 bit answers, we can run the linear method 3 times to get 3 significant bits, then run Newton's method 1 time to get another 3. For 15 bit answers, we run the linear method 4 times to get 4 bits, then Newton's method twice to get another 4 then another 7. And so on.

Those calculations depend on knowing exactly how many cycles are required to get a bit by the linear method vs. how many are required by Newton's method. If the "economics" change, e.g., by discovering a faster way to build up bits in a linear fashion, the decision of when to invoke Newton's method will change.

I unrolled the loops and implemented the decisions as switches, which I hope will translate into fast table lookups in assembly. I'm not absolutely sure that I've got the minimum number of cycles in each case, so maybe further tuning is possible. E.g., for s=10, you can try to get 5 bits then apply Newton's method once instead of twice.

I've tested the algorithm thoroughly for correctness. Some additional minor speedups are possible if you're willing to accept slightly incorrect answers in some cases. At least two cycles are used after applying Newton's method to correct an off-by-one error that occurs with numbers of the form m^2-1. And a cycle is used testing for input 0 at the beginning, as the algorithm can't handle that input. If you know you're never going to take the square root of zero you can eliminate that test. Finally, if you only need 8 significant bits in the answer, you can drop one of the Newton's method calculations.

#include <inttypes.h>
#include <stdint.h>
#include <stdbool.h>
#include <stdio.h>

uint32_t isqrt1(uint32_t n);

int main() {
  uint32_t n;
  bool it_works = true;
  for (n = 0; n < UINT32_MAX; ++n) {
    uint32_t sr = isqrt1(n);
    if ( sr*sr > n || ( sr < 65535 && (sr+1)*(sr+1) <= n )) {
      it_works = false;
      printf("isqrt(%" PRIu32 ") = %" PRIu32 "\n", n, sr);
    }
  }
  if (it_works) {
    printf("it works\n");
  }
  return 0;
}

/* table modified to return shift s to move 1 to msb of square root of x */
/*
static const uint8_t debruijn32[32] = {
    0, 31, 9, 30, 3,  8, 13, 29,  2,  5,  7, 21, 12, 24, 28, 19,
    1, 10, 4, 14, 6, 22, 25, 20, 11, 15, 23, 26, 16, 27, 17, 18
};
*/

static const uint8_t debruijn32[32] = {
  15,  0, 11, 0, 14, 11, 9, 1, 14, 13, 12, 5, 9, 3, 1, 6,
  15, 10, 13, 8, 12,  4, 3, 5, 10,  8,  4, 2, 7, 2, 7, 6
};

/* based on CLZ emulation for non-zero arguments, from
 * https://mcmap.net/q/452054/-how-to-count-leading-zeros-in-a-32-bit-unsigned-integer-closed
 */
uint8_t shift_for_msb_of_sqrt(uint32_t x) {
  x |= x >>  1;
  x |= x >>  2;
  x |= x >>  4;
  x |= x >>  8;
  x |= x >> 16;
  x++;
  return debruijn32 [x * 0x076be629 >> 27];
}

uint32_t isqrt1(uint32_t n) {
  if (n==0) return 0;

  uint32_t s = shift_for_msb_of_sqrt(n);
  uint32_t c = 1 << s;
  uint32_t g = c;

  switch (s) {
  case 9:
  case 5:
    if (g*g > n) {
      g ^= c;
    }
    c >>= 1;
    g |= c;
  case 15:
  case 14:
  case 13:
  case 8:
  case 7:
  case 4:
    if (g*g > n) {
      g ^= c;
    }
    c >>= 1;
    g |= c;
  case 12:
  case 11:
  case 10:
  case 6:
  case 3:
    if (g*g > n) {
      g ^= c;
    }
    c >>= 1;
    g |= c;
  case 2:
    if (g*g > n) {
      g ^= c;
    }
    c >>= 1;
    g |= c;
  case 1:
    if (g*g > n) {
      g ^= c;
    }
    c >>= 1;
    g |= c;
  case 0:
    if (g*g > n) {
      g ^= c;
    }
  }

  /* now apply one or two rounds of Newton's method */
  switch (s) {
  case 15:
  case 14:
  case 13:
  case 12:
  case 11:
  case 10:
    g = (g + n/g) >> 1;
  case 9:
  case 8:
  case 7:
  case 6:
    g = (g + n/g) >> 1;
  }

  /* correct potential error at m^2-1 for Newton's method */
  return (g==65536 || g*g>n) ? g-1 : g;
}

In light testing on my machine (which admittedly is nothing like yours), the new isqrt1 routine runs about 40% faster on average than the previous isqrt routine I gave.

Mishmash answered 29/6, 2015 at 20:47 Comment(4)

Interesting idea. I'm just unsure what would be the overhead of mixing two different methods. You seem to assume there would be zero overhead to mix them ? Not a single operation ? – Inclose 1/7, 2015 at 12:10

I'm going over your code right now. You did not explicitly write the assignments in C (the language allows you to chain multiple temporaries), but you can't avoid them in ASM. Each ASM instruction takes 2 operands-Some can take the 3rd-a small constant (range 0-16). So, your first line if (gg > n)* is actually 3 ops : assign,mul,compare. Next line is 1 op, then 1 op Then condition, which could actually be removed from ASM, as bitshift will set the Z-flag, so I'd just do a single-op jump (to a label with rts), then 1 op for logical or and finally 1 op for loop (jump to @LoopStart). That's 8 – Inclose 1/7, 2015 at 12:28

Now if I apply the same rules to my initial method, then 1st line takes 2 ops, 2nd line 2 ops, 3rd line 1 op, 4th line 1 op, 5th line 1 op, and loop overhead 2 ops (decrement loop register, jump on Z). Total:9 ops. So, it looks like you shaved off 1 op, which is cool ! Thanks ! Of course, if we're down to that level, we must count each comparison as 2 ops (compare, jump), so that raises the total instruction count for both solutions. It looks like you combined loop counter with the sqrt logic, which is neat in ASM as we can avoid the condition if (c==0) altogether by jumping on Z-flag. – Inclose 1/7, 2015 at 12:39

If you want isqrt to be exact, Newton's method itself has an overhead of two ops at the end to correct a potential off-by-one error. However, you have to start non-Newton part at the right place, otherwise for small numbers it will generate leading zeros instead of significant bits. That means you need a CLZ or similar instruction. The bonus is that once you know how many bits are in the answer, you can use that information to adjust the number of ops, halving the number of steps on average. I'm working on an implementation ... everything done except the CLZ simulation. – Mishmash 2/7, 2015 at 1:5

If multiplication is the same speed (or faster than!) addition and shifting, or if you lack a fast shift-by-amount-contained-in-a-register instruction, then the following will not be helpful. Otherwise:

You're computing temp*temp afresh on each loop cycle, but temp = res | add, which is the same as res + add since their bits don't overlap, and (a) you have already computed res*res on a previous loop cycle, and (b) add has special structure (it's always just a single bit). So, using the fact that (a+b)^2 = a^2 + 2ab + b^2, and you already have a^2, and b^2 is just a single bit shifted twice as far to the left as the single-bit b, and 2ab is just a left-shifted by 1 more position than the location of the single bit in b, you can get rid of the multiplication:

unsigned short int int_sqrt32(unsigned int x)
{
    unsigned short int res = 0;
    unsigned int res2 = 0;
    unsigned short int add = 0x8000;   
    unsigned int add2 = 0x80000000;   
    int i;
    for(i = 0; i < 16; i++)
    {
        unsigned int g2 = res2 + (res << i) + add2;
        if (x >= g2)
        {
            res |= add;
            res2 = g2;
        }
        add >>= 1;
        add2 >>= 2;
    }
    return res;
}

Also I would guess that it's a better idea to use the same type (unsigned int) for all variables, since according to the C standard, all arithmetic requires promotion (conversion) of narrower integer types to the widest type involved before the arithmetic operation is performed, followed by subsequent back-conversion if necessary. (This may of course be optimised away by a sufficiently intelligent compiler, but why take the risk?)

Ladyfinger answered 29/6, 2015 at 14:49 Comment(5)

Since OP will manually port the code to ASM, the last paragraph is probably not critical. – Coriss 29/6, 2015 at 15:7

@Ladyfinger Yes, the multiplication on the RISC is also a single-cycle instruction (see the first sentence in my post), but thanks for the idea, since it sparked a different one that I am going to try. Also, some time later down the road, I am planning to have one additional implementation running just on the Motorola 68000 CPU (without any RISCs), . where mult. is expensive and division super-expensive. So, having the version that does not involve either of them is going to be very important ! For the RISC, though, this version has 16*8 operations, which is more than I have right now. – Inclose 29/6, 2015 at 16:20

@3DCoder For an implementation of integer square root on the M68k, you may want to check out this paper: K. C. Johnson, "ALGORITHM 650: Efficient Square Root Implementation on the 68000", ACM TOMS 13 (1987) 138-151. The program code for the paper is available from netlib – Wollis 1/7, 2015 at 2:52

I just checked this link and it's quite amazing, I must say ! I am soon going to try to compile the C code against the 68000, but will be replacing the most expensive routines (e.g. sqrt) with some hand-optimized ASM. So this link will come in handy very soon. Thanks a lot ! – Inclose 7/7, 2015 at 12:16

Heh looks like this is the same as mine How to get a square root for 32 bit input in one clock cycle only? – Pulse 23/6, 2019 at 8:56

From the comment trail, it seems that the RISC processor only provides 32x32->32 bit multiplication and 16x16->32 bit multiplication. A 32x-32->64 bit widening multiply, or a MULHI instruction returning the upper 32 bits of a 64-bit product is not provided.

This would seem to exclude approaches based on Newton-Raphson iteration, which would likely be inefficient, as they typically require either MULHI instruction or widening multiply for the intermediate fixed-point arithmetic.

The C99 code below uses a different iterative approach that requires only 16x16->32 bit multiplies, but converges somewhat linearly, requiring up to six iterations. This approach requires CLZ functionality to quickly determine a starting guess for the iterations. Asker stated in the comments that the RISC processor used does not provide CLZ functionality. So emulation of CLZ is required, and since the emulation adds to both storage and instruction count, this may make this approach uncompetitive. I performed a brute-force search to determine the deBruijn lookup table with the smallest multiplier.

This iterative algorithm delivers raw results quite close to the desired results, i.e. (int)sqrt(x), but always somewhat on the high side due to the truncating nature of integer arithmetic. To arrive at the final result, the result is conditionally decremented until the square of the result is less than or equal to the original argument.

The use of the volatile qualifier in the code only serves to establish that all named variables can in fact be allocated as 16-bit data without impacting the functionality. I do not know whether this provides any advantage, but noticed that the OP specifically used 16-bit variables in their code. For production code, volatile should be removed.

Note that for most processors, the correction steps at the end should not involve any branching. The product y*y can be subtracted from x with carry-out (or borrow-out), then y is corrected by a subtract with carry-in (or borrow-in). So each step should be a sequence MUL, SUBcc, SUBC.

Because implementation of the iteration by a loop incurs substantial overhead, I have elected to completely unroll the loop, but provide two early-out checks. Tallying the operations manually I count 46 operations for the fastest case, 54 operations for the average case, and 60 operations for the worst case.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

static const uint8_t clz_tab[32] = {
    31, 22, 30, 21, 18, 10, 29,  2, 20, 17, 15, 13, 9,  6, 28, 1,
    23, 19, 11,  3, 16, 14,  7, 24, 12,  4,  8, 25, 5, 26, 27, 0};

uint8_t clz (uint32_t a)
{
    a |= a >> 16;
    a |= a >> 8;
    a |= a >> 4;
    a |= a >> 2;
    a |= a >> 1;
    return clz_tab [0x07c4acdd * a >> 27];
}
  
/* 16 x 16 -> 32 bit unsigned multiplication; should be single instruction */
uint32_t umul16w (uint16_t a, uint16_t b)
{
    return (uint32_t)a * b;
}

/* Reza Hashemian, "Square Rooting Algorithms for Integer and Floating-Point
   Numbers", IEEE Transactions on Computers, Vol. 39, No. 8, Aug. 1990, p. 1025
*/
uint16_t isqrt (uint32_t x)
{
    volatile uint16_t y, z, lsb, mpo, mmo, lz, t;

    if (x == 0) return x; // early out, code below can't handle zero

    lz = clz (x);         // #leading zeros, 32-lz = #bits of argument
    lsb = lz & 1;
    mpo = 17 - (lz >> 1); // m+1, result has roughly half the #bits of argument
    mmo = mpo - 2;        // m-1
    t = 1 << mmo;         // power of two for two's complement of initial guess
    y = t - (x >> (mpo - lsb)); // initial guess for sqrt
    t = t + t;            // power of two for two's complement of result
    z = y;

    y = (umul16w (y, y) >> mpo) + z;
    y = (umul16w (y, y) >> mpo) + z;
    if (x >= 0x40400) {
        y = (umul16w (y, y) >> mpo) + z;
        y = (umul16w (y, y) >> mpo) + z;
        if (x >= 0x1002000) {
            y = (umul16w (y, y) >> mpo) + z;
            y = (umul16w (y, y) >> mpo) + z;
        }
    }

    y = t - y; // raw result is 2's complement of iterated solution
    y = y - umul16w (lsb, (umul16w (y, 19195) >> 16)); // mult. by sqrt(0.5) 

    if ((int32_t)(x - umul16w (y, y)) < 0) y--; // iteration may overestimate 
    if ((int32_t)(x - umul16w (y, y)) < 0) y--; // result, adjust downward if 
    if ((int32_t)(x - umul16w (y, y)) < 0) y--; // necessary 

    return y; // (int)sqrt(x)
}

int main (void)
{
    uint32_t x = 0;
    uint16_t res, ref;

    do {
        ref = (uint16_t)sqrt((double)x);
        res = isqrt (x);
        if (res != ref) {
            printf ("!!!! x=%08x  res=%08x  ref=%08x\n", x, res, ref);
            return EXIT_FAILURE;
        }
        x++;
    } while (x);
    return EXIT_SUCCESS;
}

Another possibility is to use the Newton iteration for the square root, despite the high cost of division. For small inputs only one iteration will be required. Although the asker did not state this, based on the execution time of 16 cycles for the DIV operation I strongly suspect that this is actually a 32/16->16 bit division which requires additional guard code to avoid overflow, defined as a quotient that does not fit into 16 bits. I have added appropriate safeguards to my code based on this assumption.

Since the Newton iteration doubles the number of good bits each time it is applied, we only need a low-precision initial guess which can easily be retrieved from a table based on the five leading bits of the argument. In order to grab these, we first normalize the argument into 2.30 fixed-point format with an additional implicit scale factor of 2^{32-(lz & ~1)} where lz are the number of leading zeros in the argument. As in the previous approach the iteration doesn't always deliver an accurate result, so a correction must be applied should the preliminary result be too big. I count 49 cycles for the fast path, 70 cycles for the slow path (average 60 cycles).

static const uint16_t sqrt_tab[32] = 
{ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
  0x85ff, 0x8cff, 0x94ff, 0x9aff, 0xa1ff, 0xa7ff, 0xadff, 0xb3ff,
  0xb9ff, 0xbeff, 0xc4ff, 0xc9ff, 0xceff, 0xd3ff, 0xd8ff, 0xdcff, 
  0xe1ff, 0xe6ff, 0xeaff, 0xeeff, 0xf3ff, 0xf7ff, 0xfbff, 0xffff
};

/* 32/16->16 bit division. Note: Will overflow if x[31:16] >= y */
uint16_t udiv_32_16 (uint32_t x, uint16_t y)
{
    uint16_t r = x / y;
    return r;
}

/* table lookup for initial guess followed by division-based Newton iteration*/ 
uint16_t isqrt (uint32_t x)
{
    volatile uint16_t q, lz, y, i, xh;

    if (x == 0) return x; // early out, code below can't handle zero

    // initial guess based on leading 5 bits of argument normalized to 2.30
    lz = clz (x);
    i = ((x << (lz & ~1)) >> 27);
    y = sqrt_tab[i] >> (lz >> 1);
    xh = (x >> 16); // needed for overflow check on division

    // first Newton iteration, guard against overflow in division
    q = 0xffff;
    if (xh < y) q = udiv_32_16 (x, y);
    y = (q + y) >> 1;
    if (lz < 10) {
        // second Newton iteration, guard against overflow in division
        q = 0xffff;
        if (xh < y) q = udiv_32_16 (x, y);
        y = (q + y) >> 1;
    }

    if (umul16w (y, y) > x) y--; // adjust quotient if too large

    return y; // (int)sqrt(x)
}

Wollis answered 30/6, 2015 at 21:51 Comment(25)

I think the CLZ emulation is a good idea. I've used it in my code as well. However, note that you can save a few cycles by modifing the table instead of performing arithmetic on clz(n) (see my answer). Since your initial guess is complicated, you might want to calculate x * 0x076be629 >> 27 and then look up values for both m and initial guess with two different tables replacing the clz table. – Mishmash 2/7, 2015 at 2:33

@EdwardDoolittle I realized of course that a couple instructions worth of computation could be pulled back into the table, but wanted to keep my answer sufficiently generic, since this approach should be useful for all platforms with very fast multiplies but in particular those that offer a CLZ-type instruction. The initial guess is not actually complicated, at least not conceptually: it is basically a 1-bit prepended to the right-shifted argument, at the appropriate bit position. There may be a more elegant way of accomplishing the same functionality, but I haven't found it yet. – Wollis 2/7, 2015 at 6:14

@Wollis Thanks a lot for the approach. I admit, I didn't quite get it the first time when you posted that and need some more time to digest it. But the CLZ emulation in itself is worth it, as I needed that quite a few times already, but opted for much cheaper bitshifts (each time the results approached the integer limit). But having a proper, full-precision CLZ is definitely an important feature ! – Inclose 2/7, 2015 at 12:41

Well, on the other hand, the overhead of CLZ emulation is high enough that it might be better to just switch to fixed-point instead of full-integer calculations, since then I can have inverse sqrt (and a lot of other calculations become easier - namely additions). I guess I just wouldn't appreciate the fixed-point properly if I didn't go through the integer approach first :-) – Inclose 2/7, 2015 at 12:44

Would this solution have least amount of ops if I provided a very precise estimate? Last few days I interpolated everything else in the Radiosity computation, and sqrt is the last thing not interpolated yet. I process the square texels in a grid order (e.g. 32 rows x 32 columns). I could calculate the first sqrt of each row in full and use it as estimate for next one (and keep reusing the previous sqrt as an estimate).I'm hoping 1-2 loops should be enough to get to the precise sqrt.It does look like last 5 lines(after loop) have about ~15 ops (vs the 18 of LUT solution).Any way to remove them? – Inclose 6/7, 2015 at 14:51

Of course, I was initially thinking of the Newton solution, as that's the most intuitive one. However it involves division (on top of addition and bitshift), which costs 16 cycles, and should I need 2 iterations of Newton, that would be 2* (16+3+3) = 44 cycles. vs 18*3=54 that I have now.So,I'm now looking for a solution that would provide least amount of ops once I supply it with a very precise estimate (e.g.the estimate will be of the neighbouring texel's center on the regular grid), which will be very close.I just realized I could do a test with Newton to see if 1 iteration would be enough. – Inclose 6/7, 2015 at 15:0

The idea behind using Newton-Raphson (NR) for sqrt is to iterate for rsqrt, then backmultiply: x * rsqrt(x) = sqrt(x) [with exception of zero and infinity]. The NR iteration for rsqrt only requires multiplication, no divisions. Furthermore, in addition to the NR iteration with quadratic convergence there is a Halley iteration with cubic convergence. However, the way I have used the rsqrt iteration requires normalization of the fixed-point number which again requires CLZ. I coded Hashemian's algorithm straight from the paper, don't know whether or how one can modify the starting value – Wollis 6/7, 2015 at 17:7

Back to the drawing board. Yesterday I noticed that the LUT solution (that I praised so much) actually has 2 divisions - it has 2 iterations of Newton at the end. I have no idea how I could have missed that when I was adding the 'benchmarking' code around it, it must have been zoomed out too much and it must have looked like it's a logical OR. double facepalm.. So, instead of 18 ops, it's 16 ops plus 2*16=32 cycles for division. Now I gotta go over all replies here and recheck if one of them isn't an improvement over this, as with 32 additional cycles, it's indeed very probable.My apologies. – Inclose 7/7, 2015 at 12:8

Not knowing your actual processor, and assuming I manually counted the ops correctly, my proposed solution should come in at 37+(1..9)*6 = 43 .. 82 operations, with an average of 67 operations. The LUT approach may still be the fastest way, even with slow divisions. Note that if you want the exact result trunc(sqrt(x)) for all inputs x, the LUT-based code may need some additional fiddling at the end. I would recommend exhaustive testing of a 2**32 inputs. – Wollis 7/7, 2015 at 22:42

@Wollis I didn't really do a test over the full range of ints(though, it's a simple loop),just the range of values that the Radiosity solution needs. Which brings me to a next improvement over the LUT solution. The light source needs to be in a close distance to the wall to have a visible effect, which effectively reduces the range of values for the sqrt (which is used for the distance vector normalization). So, why spend the cycles (in the LUT method) testing for the highest bits ? I should be able to jump straight to the condition that reads from the LUT, saving multiple conditions per sqrt. – Inclose 8/7, 2015 at 15:24

@3DCoder It would have been advantageous to mention restrictions on input range in the original question. – Wollis 8/7, 2015 at 16:49

@Wollis Well, there still needs to be a reference codepath that is as fast as possible. As of now I have about 5 additional codepaths that interpolate various elements of form-factor calculation, each with its own visual errors (some more visible, some less). But the most generic solution must be able to process any input mesh at any distance, even if the light source is too far (as some energy always reaches the distant patches). So, all optimizations related to the input range will only be used in relevant codepaths (and most probably in the demo, where I got full control over the range). – Inclose 8/7, 2015 at 20:10

I haven't yet got to reexamining the solutions, as I'm in the middle of an algorithmic change that could reduce the number of sqrt calls to mere 3%. I spent last two days with form-factor drawings and analyzing multiple numerical distributions of the various radiosity elements and believe I found something new that can be linearly approximated - from early calculations, the error appears to be under 5%, which I'm hoping will not have a major visual impact. My previous sqrt estimate had an error of up to 30%, which had a negative visual impact, removing the characteristic radiosity look&feel. – Inclose 10/7, 2015 at 13:28

I have implemented the above-mentioned sqrt approximation and did indeed reduce the need to compute full-precision sqrt to 3%. 97% of data gets by with a 6-op approximation. While in 80% of cases the error is under 5% (e.g. not really noticeable, visually), in 20% of cases it goes above 100%. I know when that happens, so I would just change the sqrt algorithm on a wall-by-wall basis. I'll see if I can come up with a cleaner, non-hybrid solution. – Inclose 12/7, 2015 at 16:38

I spent some time in the debugger with the current LUT solution and noticed few important features of the algorithm.It doesn't always execute two Newton iterations. Depending on value, it executes it twice, once, or not at all (for very small values). I added some code to analyze the difference between the two and noticed that the difference in the sqrt value is 0 in 60+% of the cases (or diff. of 1), which is 0.016% of the sqrt value. As such, it's effect is visually non-existent, so by commenting it out I have effectively reduced the total complexity by 16 cycles of the division and 3 ops. – Inclose 12/7, 2015 at 16:50

Having spent more time with the current LUT solution and analyzing the typical numerical distribution of the input numbers, it should be easy to tweak it for the most common range - e.g. reorder the binary search from 4 conditions (8 ops = 4x (cmp, jz)) to mere 2 (4 ops). Removing final rounding (error under 0.01%) reduces the ops even further. Also, there is a substantial range, where not even a one iteration of Newton is required, which reduces the total number of ops to 4: (2 entry conditions) + 3 (table look-up) + 1 (jmp) = 8 assembler instructions. That's quite unexpected... – Inclose 13/7, 2015 at 11:36

I just re-read my initial post and noticed my initial sqrt solution had an estimate of 80 instructions (probably more when actually written) and my latest one has 8. That is a 10:1 ratio! I learnt quite a bit about sqrt, but more importantly, learnt the importance of tweaking your data range into the best possible given the algorithm specifics. Since I am premultiplying the data before sqrt, I might as well transform them into range which is best for sqrt (and which gives least amount of error), knowing there is such a range – Inclose 13/7, 2015 at 11:46

I can't seem to make sense of I performed a brute-force search to determine the deBruijn lookup table with the smallest multiplier in conjunction with the multiplier changing from 0x076be629 to 0x07c4acdd. – Epigone 19/7, 2015 at 7:19

Not just the multiplier changed. The previous CLZ emulation had an increment just prior to multiplication to ensure that one factor in the multiplication was a power of two. I first searched for the smallest multiplier for that variant, which is 0x04653adf. I then figured I should check what happens if I remove the increment, which saves one integer ADD. For that variant, the smallest multiplier is 0x07c4acdd. I tried to find magic multipliers with only few 1-bits, but all multipliers that work have a population count of 16, so that turned out to be pretty much "mission impossible". – Wollis 19/7, 2015 at 10:43

How did one go about calculating the early out points? Am attempting to expand this to support 64 bit integers. – Spontaneous 2/1, 2023 at 22:16

@Spontaneous Same approach extended to 128 bits in this answer. I think I determined the cut-over points by experiment. – Wollis 2/1, 2023 at 22:38

@Wollis Sorry, I should've been more specific. Was referring to the implementation that references Reza Hashemian's paper; your two cutoff points don't appear to be arbitrary at all (see oeis.org/A290075/b290075.txt). – Spontaneous 2/1, 2023 at 23:4

Oh, and how is y = y - umul16w (lsb, (umul16w (y, 19195) >> 16)) equivalent to multiplying by the square root of 0.5? That's some wizardry that I've been unable to decipher. I think I need to figure out how to generalize that as well since my function works on N bits (8, 16, 32, 64, 128) instead of 32. – Spontaneous 2/1, 2023 at 23:10

@Spontaneous 1 - (19195 / 65536) = 0.7071075 ~= sqrt (0.5). This is fixed-point arithmetic. As for cutover points, I am pretty sure I simply iterated over all 2**32 arguments to find out where it would fail with one step. Put the cutover for two steps there, then find first failure, put the second cutover point (for three steps) there. – Wollis 2/1, 2023 at 23:21

@Wollis Thanks mate! I was able to sort out a generic implementation here; couldn't have done it without your guidance/code. – Spontaneous 3/1, 2023 at 20:18

I don't know how to turn it into an efficient algorithm but when I investigated this in the '80s an interesting pattern emerged. When rounding square roots, there are two more integers with that square root than the preceding one (after zero).

So, one number (zero) has a square root of zero, two have a square root of 1 (1 and 2), 4 have a square root of two (3, 4, 5 and 6) and so on. Probably not a useful answer but interesting nonetheless.

Snakeroot answered 23/5, 2016 at 19:36 Comment(4)

If you round down, this is another way to view the fact that the sum of the first n odd numbers is n^2. If you just round, I'm not convinced it's true. – Accad 30/5, 2016 at 20:0

Damn!, he's right in the first paragraph but minor correction to the second paragraph: this pattern doesn't work for the square roots of 0, 1, and 2 (it only starts at 3 and onwards.) I seriously doubted this but now I doubt myself after confirming this to be true at least for the square roots of the first 5 billion numbers. Very interesting and thank you for sharing. – Nemato 13/7, 2022 at 22:51

What makes this really interesting is that it's obvious from the standpoint of derivatives that d/dx[x^2] is 2x, which means that the strides between successive square number will be 2 more than the previous stride if we were rounding down. However, what makes this bizzare (for me at least) is that applying normal rounding always increments by 2 without exception. In further tests rounding from decimals other than 0.0 (floor/ceil) or 0.5 (normal rounding), I discovered no other rounding point that causes this behavior. – Nemato 13/7, 2022 at 23:13

I only checked for numbers up to 16 bits (3D graphics in 6502 on the C64) but rounding to the nearest integer was spot on every time. Weird, that. – Snakeroot 19/7, 2022 at 10:1

Here is a less incremental version of the technique @j_random_hacker described. On at least one processor it was just a bit faster when I fiddled with this a couple of years ago. I have no idea why.

// assumes unsigned is 32 bits
unsigned isqrt1(unsigned x) {
  unsigned r = 0, r2 = 0; 
  for (int p = 15; p >= 0; --p) {
    unsigned tr2 = r2 + (r << (p + 1)) + (1u << (p + p));
    if (tr2 <= x) {
      r2 = tr2;
      r |= (1u << p);
    }
  }
  return r;
}

/*
gcc 6.3 -O2
isqrt(unsigned int):
        mov     esi, 15
        xor     r9d, r9d
        xor     eax, eax
        mov     r8d, 1
.L3:
        lea     ecx, [rsi+1]
        mov     edx, eax
        mov     r10d, r8d
        sal     edx, cl
        lea     ecx, [rsi+rsi]
        sal     r10d, cl
        add     edx, r10d
        add     edx, r9d
        cmp     edx, edi
        ja      .L2
        mov     r11d, r8d
        mov     ecx, esi
        mov     r9d, edx
        sal     r11d, cl
        or      eax, r11d
.L2:
        sub     esi, 1
        cmp     esi, -1
        jne     .L3
        rep ret
*/

If you turn up gcc 9 x86 optimization, it completely unrolls the loop and folds constants. The result is still only about 100 instructions.

Unarmed answered 22/6, 2019 at 16:47 Comment(3)

Thanks, but this is way too many ops compared to the baseline algorithm in my post, which only has 4 ops (OR, MUL, SHIFT, CMP) within the loop. Yours has 8 (plus 4 inside the condition) and there would be at least 4-5 more for intermediate results (target ASM doesn't have 3-operand instructions, hence lots of MOVE ops) that would be needed for computing tr2. At the very least, the total Instruction count would be: 16 * (2(loop) + 2(condition) + 12(tr2))=16*16 =256, plus however many times the condition (tr2<=x) would get invoked. So, around ~300, compared to ~17 in my LUT approach(Method2). – Inclose 23/2, 2022 at 20:17

This algorithm was designed (by Knuth I think) when multiply was extremely expensive. On small microcontrollers, it still is. I posted because it's interesting and in case avoiding multiplication is a good thing. I made no claims about its speed in your application. – Unarmed 24/2, 2022 at 2:45

Yeah, it might be worth trying on something like 6502, though you would have to handle lo/hi byte somehow. Because my RISC does MUL in same 3 cycles , it makes no sense for me. – Inclose 24/2, 2022 at 6:6

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags