How to get a square root for 32 bit input in one clock cycle only?

Asked 7/1, 2016 at 9:50 Answered 19/7, 2022 at 0:0

I want to design a synthesizable module in Verilog which will take only one cycle in calculating square root of given input of 32 bit.

Odelle answered 7/1, 2016 at 9:50 Comment(8)

If you really want just 1T of clock then the only way I can think of is 2^32 or 2^31 or 2^30 x 2 Byte precomputed LUT table in some ROM size 2-8GB depend on signed/unsigned type and if you want to ignore the LSB. Any approach to compute this I know of needs few iterations (approximation) or up to 16T for binary search. If different 1T approach exist then is probably based on PCA and need different LUT tables. (sorry for the multiple edits of this comment ...) – Ellie 7/1, 2016 at 10:4

i have to use this module many times so that will increase the overhead if i used LUT table – Odelle 7/1, 2016 at 13:26

You need to compromise between speed and used gate numbers ... your best bet is approximation polynomial or binary search ... but both need to use multiplication as core function so you need fast multiplication 16b x 16b = 32b too, but that can be done in just few T too (2-3T if I imagine it correctly). – Ellie 7/1, 2016 at 13:30

I'm voting to close this question as off-topic because it's not related to Verilog – Graf 7/1, 2016 at 14:40

one clock cycle for the whole calculation or via a pipeline being able to average one result per cycle? – Waterfowl 7/1, 2016 at 14:53

@Henry I think that is because here are most people programmers not chip developers. For FPGA / chip design questions I think there would be more luck in electrical engineering, DSP and MCU oriented sites instead of here. – Ellie 7/1, 2016 at 18:0

Maybe you can think of a parallelized/pipelined implementation of en.wikipedia.org/wiki/… – Comparable 13/2, 2016 at 16:57

If you don't want to sidestep the cycle issue using a humongous self-timing or combinatorial circuit, table lookup seems the way to go. input of 32 bit doesn't much to specify number representation (Spektre seems to assume integer). Using linear approximation, the table(s) needed shouldn't be quite as large. – Classify 1/5, 2016 at 17:23

[Edit1] repaired code

Recently found the results where off even if tests determine all was OK so I dig deeper and found out that I had a silly bug in my equation and due to name conflicts with my pgm environment the tests got false positives so I overlooked it before. Now it work in all cases as it should.

The best thing I can think of (except approximation or large LUT) is binary search without multiplication, here C++ code:

//---------------------------------------------------------------------------
WORD u32_sqrt(DWORD xx) // 16 T
    {
    DWORD x,m,a0,a1,i;
    const DWORD lut[16]=
        {
        //     m*m
        0x40000000,
        0x10000000,
        0x04000000,
        0x01000000,
        0x00400000,
        0x00100000,
        0x00040000,
        0x00010000,
        0x00004000,
        0x00001000,
        0x00000400,
        0x00000100,
        0x00000040,
        0x00000010,
        0x00000004,
        0x00000001,
        };
    for (x=0,a0=0,m=0x8000,i=0;m;m>>=1,i++)
        {
        a1=a0+lut[i]+(x<<(16-i));
        if (a1<=xx) { a0=a1; x|=m; }
        }
    return x;
    }
//---------------------------------------------------------------------------

Standard binary search sqrt(xx) is setting bits of x from MSB to LSB so that result of x*x <= xx. Luckily we can avoid the multiplication by simply rewrite the thing as incrementing multiplicant... in each iteration the older x*x result can be used like this:

x1 = x0+m
x1*x1 = (x0+m)*(x0+m) = (x0*x0) + (2*m*x0) + (m*m)

Where x0 is value of x from last iteration and x1 is actual value. The m is weight of actual processed bit. The (2*m) and (m*m) are constant and can be used as LUT and bit-shift so no need to multiply. Only addition is needed. Sadly the iteration is bound to sequential computation forbid paralelisation so the result is 16T at best.

In the code a0 represents last x*x and a1 represents actual iterated x*x

As you can see the sqrt is done in 16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare) where the bit shift and LUT can be hardwired.

If you got super fast gates for this in comparison to the rest you can multiply the input clock by 16 and use that as internal timing for SQRT module. Something similar to the old days when there was MC clock as Division of source CPU clock in old Intel CPU/MCUs ... This way you can get 1T timing (or multiple of it depends on the multiplication ratio).

Ellie answered 7/1, 2016 at 14:44 Comment(6)

So... you changed the tags to create a legitimate question, and then answered the question in C++, when the OP wanted Verilog? Is that kosher? Perhaps you should now remove the Verilog tag? And what about the OP's request for a single-cycle solution - you're saying that 16 cycles can be one cycle if they're fast enough? – Graf 7/1, 2016 at 17:10

@Graf As mentioned this does not really answer in 1T anyway the language tag means that Answer is preferred in Verilog (VHDL) language but that does not mean this is strictly bound to that language. I could also draw the circuit instead C++ code but too lazy for that. I changed the tags because those I choose I feel are more important then the language itself for this task and should attract the right people for this. The circuit part is done only when you know what and how exactly it is done. But of coarse if you rewrite this to VHDL it will be more suited and probably get more votes. – Ellie 7/1, 2016 at 17:54

Thanks for your answers – Odelle 7/1, 2016 at 19:13

Verilog and VHDL are different languages, both HDLs but VHDL does not stand for Verilog HDL. – Tate 7/1, 2016 at 23:6

Morgan is right, it stands for VHSIC HDL (an acronym inside an acronym, fantastic!), which stands for Very High Speed Integrated Circuit Hardware Description Language... which may very well be the worst acronym ever. – Distant 8/1, 2016 at 13:30

My version with variable bit count in. 2 to 32 bits even numbers – Haag 18/7, 2022 at 23:59

In 2018, T. Bagala, A. Fibich, M. Hagara, P. Kubinec, O. Ondráček, V. Štofanik and R. Stojanović authored Single Clock Square Root Algorithm Based on Binomial Series and its FPGA Implementation.

Local oscillator runs at 50MHz [… For 16 bit input mantissa,] Values from [the hardware] experiment were the same as values from simulation […] Obtained delay averages were 892ps and 906ps respectively.

(No explanation about the discrepancy between 50MHz and .9ns or the quoted ps resolution and the use of a 10Gsps scope. If it was about 18 cycles (due to pipelining rather than looping?)/~900*ns*, interpretation of Single Clock Square Root… remains open - may be one result per cycle.)
The paper discloses next no details about the evaluation of the binomial series.
While the equations are presented in a general form, too, my guess is that the amount of hardware needed for a greater number of bits gets prohibitive quickly.

Classify answered 24/11, 2019 at 10:39 Comment(3)

Just found your new post +1 maybe the 0.9ps is just the propagation delay between input and output ... and 50MHz is just the clock they used even if the limit is much higher (or limited by noise and HF problems instead of gate speed) do you have any link to this ? maybe even in original language from the names I assume Czech ... would love to read it ... – Ellie 30/9, 2020 at 7:37

I find it at semanticscholar; no idea currently where I saw the contents - I may have paid a visit to my alma mater. – Classify 30/9, 2020 at 11:24

I found out this earlier: ieeexplore.ieee.org/document/8406022 but have no access however my coleage has and he sended the paper back to me just now it looks much better than yours (from understandability point) looks like they are exploiting: sqrt(1+x) = 1 + (x^1)/2 - (x^2)/8 + (x^3)/16 - 5*(x^4)/128 + .... and the fact that already computed subresults digits are not changing ... – Ellie 30/9, 2020 at 11:24

There is conversion to a logarithm, halving, and converting back.
For an idea how to implement "combinatorial" log and antilog, see Michael Dunn's EDN article showing priority encoder, barrel shifter & lookup table, with three log variants in System Verilog for down-load.
(Priority encoder, barrel shifter & lookup table look promising for "one-step-Babylonian/Heron/Newton/-Raphson. But that would probably still need a 128K by 9 bits lookup table.)

While not featuring "verilog",
Tole Sutikno: "An Optimized Square Root Algorithm for Implementation in FPGA Hardware" shows a combinatorial implementation of a modified (binary) digit-by-digit algorithm.

Classify answered 13/2, 2018 at 16:53 Comment(1)

log approach sounds promising (+1) but hard to say if it is faster than mine or not as there are still iterations + LUTs would be nice to try it out on HW and compare... – Ellie 24/11, 2019 at 8:42

I got the code here it is

    module sqrt(
input[31:0]a,
output[15:0]out
    );
reg [31:0]temp;
reg[14:0]x;

always@(a)
begin
if(a<257)x=4;
if(a>256 && a<65537)x=80;
if(a>65536 && a<16777217)x=1000;
if(a>16777216 && a<=4294967295)x=20000;
temp=(x+(a/x))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
end

assign out=temp;
endmodule

Odelle answered 7/1, 2016 at 19:12 Comment(5)

task sqrt; input[31:0]a; output[15:0]out; reg [31:0]temp,temp1,temp2,temp3,temp4,temp5,temp6; reg [14:0]x; begin if(a<257)x=4; if(a>256 && a<65537)x=80; if(a>65536 && a<16777217)x=1000; if(a>16777216 && a<=4294967295)x=20000; temp=(x+(a/x))/2; temp1=(temp+(a/temp))/2; temp2=(temp1+(a/temp1))/2; temp3=(temp2+(a/temp2))/2; temp4=(temp3+(a/temp3))/2; temp5=(temp4+(a/temp4))/2; temp6=(temp5+(a/temp5))/2; out=temp6; – Odelle 13/2, 2016 at 16:39

This comment is not an explanation. If you want to post an updated solution, just click on edit below the this answer. – Trompe 13/2, 2016 at 17:50

Doesn't use of divide operation seven times (a/temp) make this implementation virtually unusable? My understanding is that true hardware implementations for integer sqrt and integer div are implemented using similar algorithms (testing bits in a result from highest one to lowest one) and should have similar latency. – Wimsatt 1/5, 2016 at 8:0

Basically Newtons method: en.wikipedia.org/wiki/Methods_of_computing_square_roots – Stubbed 1/5, 2016 at 15:25

indent your code properly, and edit it to add explanation. Code in comments should be put between backticks `likethis` to make it readable – Charlottecharlottenburg 1/7, 2016 at 6:13

The usual means of doing this in hardware is using a CORDIC. A general implementation allows the calculation of a variety of transcendental functions (cos/sin/tan) and... square roots depending on how you initialize and operate the CORDIC.

It's an iterative algorithm so to do it in a single cycle you'd unroll the loop into as many iterations as you require for your desired precision and chain the instances together.

Specifically if you operated the CORDIC in vectoring mode, initialize it with [x, 0] and rotate to 45 degrees the [x', y'] final output will be a multiplicative constant away. i.e. sqrt(x) = x' * sqrt(2) * K

Condemnatory answered 13/2, 2018 at 14:11 Comment(4)

What about one clock cycle only? – Classify 13/2, 2018 at 14:53

If you unroll the loop it is one cycle as I stated. So if you had a block that did one iteration of the algorithm you'd simple chain inputs to outputs (stage->stage->stage->....). Each stage gets closer to the answer in a predictable way so you simple have as many stages as your desired precision requires. – Condemnatory 13/2, 2018 at 14:57

Making for a horribly long delay and slow clock, as every other iterative algorithm converted to a combinatorial circuit. (With kind of an exception for exactly one iteration.) – Classify 13/2, 2018 at 15:0

Uh, ok. The question was how to calculate sqrt in a single cycle. All the answers here are either a) iterative or b) Use dividers. Square root is a non-trivial math operation in hardware. You get to pick small (options presented above) or fast (big ass precomputed table). You can't have both. – Condemnatory 13/2, 2018 at 15:8

My version of Spektre with variable bits count in so it can be faster on short squares.

    const unsigned int isqrt_lut[16] =
{
    //     m*m
    0x40000000,
    0x10000000,
    0x04000000,
    0x01000000,
    0x00400000,
    0x00100000,
    0x00040000,
    0x00010000,
    0x00004000,
    0x00001000,
    0x00000400,
    0x00000100,
    0x00000040,
    0x00000010,
    0x00000004,
    0x00000001,
};
/// Our largest golf ball image is about 74 pixels, so lets round up to power of 2 and we get 128.
/// 128 squared is 16384 so out largest sqrt has to handle 16383 or 14 bits. Only positive values.
/// ** maxBits in is 2 to 32 always an even number **
/// Input value mist always be less than (2^maxBits) - 1
unsigned int isqrt(unsigned int xx, int maxBitsIn) {
    DWORD x, m, a0, a1, i;

    for (x = 0, a0 = 0, m = 0x01 << (maxBitsIn / 2 - 1), i = 16 - maxBitsIn / 2; m; m >>= 1, i++)
    {
        a1 = a0 + isqrt_lut[i] + (x << (16 - i));
        if (a1 <= xx) {
            a0 = a1;
            x |= m;
        }
    }
    return x;
}

Haag answered 19/7, 2022 at 0:0 Comment(2)

Did you yourself write the code presented? What about one clock cycle only? – Classify 19/7, 2022 at 4:42

It is Spekres' code. I adapted it to take a bit precision input. On can not get the sqrt in a single clock cycle at 3GHz. One "single clock" way is to just have a huge lookup table but that has memory access issues that slow it down. If the ALU in the CPU has a sqrt instruction it can be very fast, but still not one clock cycle. – Haag 21/7, 2022 at 16:1

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags