How to get a square root for 32 bit input in one clock cycle only?
Asked Answered
O

6

2

I want to design a synthesizable module in Verilog which will take only one cycle in calculating square root of given input of 32 bit.

Odelle answered 7/1, 2016 at 9:50 Comment(8)
If you really want just 1T of clock then the only way I can think of is 2^32 or 2^31 or 2^30 x 2 Byte precomputed LUT table in some ROM size 2-8GB depend on signed/unsigned type and if you want to ignore the LSB. Any approach to compute this I know of needs few iterations (approximation) or up to 16T for binary search. If different 1T approach exist then is probably based on PCA and need different LUT tables. (sorry for the multiple edits of this comment ...)Ellie
i have to use this module many times so that will increase the overhead if i used LUT tableOdelle
You need to compromise between speed and used gate numbers ... your best bet is approximation polynomial or binary search ... but both need to use multiplication as core function so you need fast multiplication 16b x 16b = 32b too, but that can be done in just few T too (2-3T if I imagine it correctly).Ellie
I'm voting to close this question as off-topic because it's not related to VerilogGraf
one clock cycle for the whole calculation or via a pipeline being able to average one result per cycle?Waterfowl
@Henry I think that is because here are most people programmers not chip developers. For FPGA / chip design questions I think there would be more luck in electrical engineering, DSP and MCU oriented sites instead of here.Ellie
Maybe you can think of a parallelized/pipelined implementation of en.wikipedia.org/wiki/…Comparable
If you don't want to sidestep the cycle issue using a humongous self-timing or combinatorial circuit, table lookup seems the way to go. input of 32 bit doesn't much to specify number representation (Spektre seems to assume integer). Using linear approximation, the table(s) needed shouldn't be quite as large.Classify
E
3

[Edit1] repaired code

Recently found the results where off even if tests determine all was OK so I dig deeper and found out that I had a silly bug in my equation and due to name conflicts with my pgm environment the tests got false positives so I overlooked it before. Now it work in all cases as it should.

The best thing I can think of (except approximation or large LUT) is binary search without multiplication, here C++ code:

//---------------------------------------------------------------------------
WORD u32_sqrt(DWORD xx) // 16 T
    {
    DWORD x,m,a0,a1,i;
    const DWORD lut[16]=
        {
        //     m*m
        0x40000000,
        0x10000000,
        0x04000000,
        0x01000000,
        0x00400000,
        0x00100000,
        0x00040000,
        0x00010000,
        0x00004000,
        0x00001000,
        0x00000400,
        0x00000100,
        0x00000040,
        0x00000010,
        0x00000004,
        0x00000001,
        };
    for (x=0,a0=0,m=0x8000,i=0;m;m>>=1,i++)
        {
        a1=a0+lut[i]+(x<<(16-i));
        if (a1<=xx) { a0=a1; x|=m; }
        }
    return x;
    }
//---------------------------------------------------------------------------

Standard binary search sqrt(xx) is setting bits of x from MSB to LSB so that result of x*x <= xx. Luckily we can avoid the multiplication by simply rewrite the thing as incrementing multiplicant... in each iteration the older x*x result can be used like this:

x1 = x0+m
x1*x1 = (x0+m)*(x0+m) = (x0*x0) + (2*m*x0) + (m*m)

Where x0 is value of x from last iteration and x1 is actual value. The m is weight of actual processed bit. The (2*m) and (m*m) are constant and can be used as LUT and bit-shift so no need to multiply. Only addition is needed. Sadly the iteration is bound to sequential computation forbid paralelisation so the result is 16T at best.

In the code a0 represents last x*x and a1 represents actual iterated x*x

As you can see the sqrt is done in 16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare) where the bit shift and LUT can be hardwired.

If you got super fast gates for this in comparison to the rest you can multiply the input clock by 16 and use that as internal timing for SQRT module. Something similar to the old days when there was MC clock as Division of source CPU clock in old Intel CPU/MCUs ... This way you can get 1T timing (or multiple of it depends on the multiplication ratio).

Ellie answered 7/1, 2016 at 14:44 Comment(6)
So... you changed the tags to create a legitimate question, and then answered the question in C++, when the OP wanted Verilog? Is that kosher? Perhaps you should now remove the Verilog tag? And what about the OP's request for a single-cycle solution - you're saying that 16 cycles can be one cycle if they're fast enough?Graf
@Graf As mentioned this does not really answer in 1T anyway the language tag means that Answer is preferred in Verilog (VHDL) language but that does not mean this is strictly bound to that language. I could also draw the circuit instead C++ code but too lazy for that. I changed the tags because those I choose I feel are more important then the language itself for this task and should attract the right people for this. The circuit part is done only when you know what and how exactly it is done. But of coarse if you rewrite this to VHDL it will be more suited and probably get more votes.Ellie
Thanks for your answersOdelle
Verilog and VHDL are different languages, both HDLs but VHDL does not stand for Verilog HDL.Tate
Morgan is right, it stands for VHSIC HDL (an acronym inside an acronym, fantastic!), which stands for Very High Speed Integrated Circuit Hardware Description Language... which may very well be the worst acronym ever.Distant
My version with variable bit count in. 2 to 32 bits even numbersHaag
C
3

In 2018, T. Bagala, A. Fibich, M. Hagara, P. Kubinec, O. Ondráček, V. Štofanik and R. Stojanović authored Single Clock Square Root Algorithm Based on Binomial Series and its FPGA Implementation.

Local oscillator runs at 50MHz [… For 16 bit input mantissa,] Values from [the hardware] experiment were the same as values from simulation […] Obtained delay averages were 892ps and 906ps respectively.

(No explanation about the discrepancy between 50MHz and .9ns or the quoted ps resolution and the use of a 10Gsps scope. If it was about 18 cycles (due to pipelining rather than looping?)/~900*ns*, interpretation of Single Clock Square Root… remains open - may be one result per cycle.)
The paper discloses next no details about the evaluation of the binomial series.
While the equations are presented in a general form, too, my guess is that the amount of hardware needed for a greater number of bits gets prohibitive quickly.

Classify answered 24/11, 2019 at 10:39 Comment(3)
Just found your new post +1 maybe the 0.9ps is just the propagation delay between input and output ... and 50MHz is just the clock they used even if the limit is much higher (or limited by noise and HF problems instead of gate speed) do you have any link to this ? maybe even in original language from the names I assume Czech ... would love to read it ...Ellie
I find it at semanticscholar; no idea currently where I saw the contents - I may have paid a visit to my alma mater.Classify
I found out this earlier: ieeexplore.ieee.org/document/8406022 but have no access however my coleage has and he sended the paper back to me just now it looks much better than yours (from understandability point) looks like they are exploiting: sqrt(1+x) = 1 + (x^1)/2 - (x^2)/8 + (x^3)/16 - 5*(x^4)/128 + .... and the fact that already computed subresults digits are not changing ...Ellie
C
1

There is conversion to a logarithm, halving, and converting back.
For an idea how to implement "combinatorial" log and antilog, see Michael Dunn's EDN article showing priority encoder, barrel shifter & lookup table, with three log variants in System Verilog for down-load.
(Priority encoder, barrel shifter & lookup table look promising for "one-step-Babylonian/Heron/Newton/-Raphson. But that would probably still need a 128K by 9 bits lookup table.)

While not featuring "verilog",
Tole Sutikno: "An Optimized Square Root Algorithm for Implementation in FPGA Hardware" shows a combinatorial implementation of a modified (binary) digit-by-digit algorithm.

Classify answered 13/2, 2018 at 16:53 Comment(1)
log approach sounds promising (+1) but hard to say if it is faster than mine or not as there are still iterations + LUTs would be nice to try it out on HW and compare...Ellie
O
0

I got the code here it is

    module sqrt(
input[31:0]a,
output[15:0]out
    );
reg [31:0]temp;
reg[14:0]x;

always@(a)
begin
if(a<257)x=4;
if(a>256 && a<65537)x=80;
if(a>65536 && a<16777217)x=1000;
if(a>16777216 && a<=4294967295)x=20000;
temp=(x+(a/x))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
end

assign out=temp;
endmodule
Odelle answered 7/1, 2016 at 19:12 Comment(5)
task sqrt; input[31:0]a; output[15:0]out; reg [31:0]temp,temp1,temp2,temp3,temp4,temp5,temp6; reg [14:0]x; begin if(a<257)x=4; if(a>256 && a<65537)x=80; if(a>65536 && a<16777217)x=1000; if(a>16777216 && a<=4294967295)x=20000; temp=(x+(a/x))/2; temp1=(temp+(a/temp))/2; temp2=(temp1+(a/temp1))/2; temp3=(temp2+(a/temp2))/2; temp4=(temp3+(a/temp3))/2; temp5=(temp4+(a/temp4))/2; temp6=(temp5+(a/temp5))/2; out=temp6;Odelle
This comment is not an explanation. If you want to post an updated solution, just click on edit below the this answer.Trompe
Doesn't use of divide operation seven times (a/temp) make this implementation virtually unusable? My understanding is that true hardware implementations for integer sqrt and integer div are implemented using similar algorithms (testing bits in a result from highest one to lowest one) and should have similar latency.Wimsatt
Basically Newtons method: en.wikipedia.org/wiki/Methods_of_computing_square_rootsStubbed
indent your code properly, and edit it to add explanation. Code in comments should be put between backticks `likethis` to make it readableCharlottecharlottenburg
C
0

The usual means of doing this in hardware is using a CORDIC. A general implementation allows the calculation of a variety of transcendental functions (cos/sin/tan) and... square roots depending on how you initialize and operate the CORDIC.

It's an iterative algorithm so to do it in a single cycle you'd unroll the loop into as many iterations as you require for your desired precision and chain the instances together.

Specifically if you operated the CORDIC in vectoring mode, initialize it with [x, 0] and rotate to 45 degrees the [x', y'] final output will be a multiplicative constant away. i.e. sqrt(x) = x' * sqrt(2) * K

Condemnatory answered 13/2, 2018 at 14:11 Comment(4)
What about one clock cycle only?Classify
If you unroll the loop it is one cycle as I stated. So if you had a block that did one iteration of the algorithm you'd simple chain inputs to outputs (stage->stage->stage->....). Each stage gets closer to the answer in a predictable way so you simple have as many stages as your desired precision requires.Condemnatory
Making for a horribly long delay and slow clock, as every other iterative algorithm converted to a combinatorial circuit. (With kind of an exception for exactly one iteration.)Classify
Uh, ok. The question was how to calculate sqrt in a single cycle. All the answers here are either a) iterative or b) Use dividers. Square root is a non-trivial math operation in hardware. You get to pick small (options presented above) or fast (big ass precomputed table). You can't have both.Condemnatory
H
0

My version of Spektre with variable bits count in so it can be faster on short squares.

    const unsigned int isqrt_lut[16] =
{
    //     m*m
    0x40000000,
    0x10000000,
    0x04000000,
    0x01000000,
    0x00400000,
    0x00100000,
    0x00040000,
    0x00010000,
    0x00004000,
    0x00001000,
    0x00000400,
    0x00000100,
    0x00000040,
    0x00000010,
    0x00000004,
    0x00000001,
};
/// Our largest golf ball image is about 74 pixels, so lets round up to power of 2 and we get 128.
/// 128 squared is 16384 so out largest sqrt has to handle 16383 or 14 bits. Only positive values.
/// ** maxBits in is 2 to 32 always an even number **
/// Input value mist always be less than (2^maxBits) - 1
unsigned int isqrt(unsigned int xx, int maxBitsIn) {
    DWORD x, m, a0, a1, i;

    for (x = 0, a0 = 0, m = 0x01 << (maxBitsIn / 2 - 1), i = 16 - maxBitsIn / 2; m; m >>= 1, i++)
    {
        a1 = a0 + isqrt_lut[i] + (x << (16 - i));
        if (a1 <= xx) {
            a0 = a1;
            x |= m;
        }
    }
    return x;
}
Haag answered 19/7, 2022 at 0:0 Comment(2)
Did you yourself write the code presented? What about one clock cycle only?Classify
It is Spekres' code. I adapted it to take a bit precision input. On can not get the sqrt in a single clock cycle at 3GHz. One "single clock" way is to just have a huge lookup table but that has memory access issues that slow it down. If the ALU in the CPU has a sqrt instruction it can be very fast, but still not one clock cycle.Haag

© 2022 - 2024 — McMap. All rights reserved.