I want to design a synthesizable module in Verilog which will take only one cycle in calculating square root of given input of 32 bit.
[Edit1] repaired code
Recently found the results where off even if tests determine all was OK so I dig deeper and found out that I had a silly bug in my equation and due to name conflicts with my pgm environment the tests got false positives so I overlooked it before. Now it work in all cases as it should.
The best thing I can think of (except approximation or large LUT) is binary search without multiplication, here C++ code:
//---------------------------------------------------------------------------
WORD u32_sqrt(DWORD xx) // 16 T
{
DWORD x,m,a0,a1,i;
const DWORD lut[16]=
{
// m*m
0x40000000,
0x10000000,
0x04000000,
0x01000000,
0x00400000,
0x00100000,
0x00040000,
0x00010000,
0x00004000,
0x00001000,
0x00000400,
0x00000100,
0x00000040,
0x00000010,
0x00000004,
0x00000001,
};
for (x=0,a0=0,m=0x8000,i=0;m;m>>=1,i++)
{
a1=a0+lut[i]+(x<<(16-i));
if (a1<=xx) { a0=a1; x|=m; }
}
return x;
}
//---------------------------------------------------------------------------
Standard binary search sqrt(xx)
is setting bits of x
from MSB to LSB so that result of x*x <= xx
. Luckily we can avoid the multiplication by simply rewrite the thing as incrementing multiplicant... in each iteration the older x*x
result can be used like this:
x1 = x0+m
x1*x1 = (x0+m)*(x0+m) = (x0*x0) + (2*m*x0) + (m*m)
Where x0
is value of x
from last iteration and x1
is actual value. The m
is weight of actual processed bit. The (2*m)
and (m*m)
are constant and can be used as LUT and bit-shift so no need to multiply. Only addition is needed. Sadly the iteration is bound to sequential computation forbid paralelisation so the result is 16T
at best.
In the code a0
represents last x*x
and a1
represents actual iterated x*x
As you can see the sqrt
is done in 16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare)
where the bit shift and LUT can be hardwired.
If you got super fast gates for this in comparison to the rest you can multiply the input clock by 16
and use that as internal timing for SQRT module. Something similar to the old days when there was MC clock as Division of source CPU clock in old Intel CPU/MCUs ... This way you can get 1T
timing (or multiple of it depends on the multiplication ratio).
1T
anyway the language tag means that Answer is preferred in Verilog (VHDL) language but that does not mean this is strictly bound to that language. I could also draw the circuit instead C++ code but too lazy for that. I changed the tags because those I choose I feel are more important then the language itself for this task and should attract the right people for this. The circuit part is done only when you know what and how exactly it is done. But of coarse if you rewrite this to VHDL it will be more suited and probably get more votes. –
Ellie In 2018, T. Bagala, A. Fibich, M. Hagara, P. Kubinec, O. Ondráček, V. Štofanik and R. Stojanović authored Single Clock Square Root Algorithm Based on Binomial Series and its FPGA Implementation.
Local oscillator runs at 50MHz [… For 16 bit input mantissa,] Values from [the hardware] experiment were the same as values from simulation […] Obtained delay averages were 892ps and 906ps respectively.
(No explanation about the discrepancy between 50MHz and .9ns or the quoted ps resolution and the use of a 10Gsps scope. If it was about 18 cycles (due to pipelining rather than looping?)/~900*ns*, interpretation of Single Clock Square Root… remains open - may be one result per cycle.)
The paper discloses next no details about the evaluation of the binomial series.
While the equations are presented in a general form, too, my guess is that the amount of hardware needed for a greater number of bits gets prohibitive quickly.
0.9ps
is just the propagation delay between input and output ... and 50MHz is just the clock they used even if the limit is much higher (or limited by noise and HF problems instead of gate speed) do you have any link to this ? maybe even in original language from the names I assume Czech ... would love to read it ... –
Ellie sqrt(1+x) = 1 + (x^1)/2 - (x^2)/8 + (x^3)/16 - 5*(x^4)/128 + ....
and the fact that already computed subresults digits are not changing ... –
Ellie There is conversion to a logarithm, halving, and converting back.
For an idea how to implement "combinatorial" log and antilog, see Michael Dunn's EDN article showing priority encoder, barrel shifter & lookup table, with three log variants in System Verilog for down-load.
(Priority encoder, barrel shifter & lookup table look promising for "one-step-Babylonian/Heron/Newton/-Raphson. But that would probably still need a 128K by 9 bits lookup table.)
While not featuring "verilog",
Tole Sutikno: "An Optimized Square Root Algorithm for Implementation in FPGA Hardware" shows a combinatorial implementation of a modified (binary) digit-by-digit algorithm.
log
approach sounds promising (+1) but hard to say if it is faster than mine or not as there are still iterations + LUTs would be nice to try it out on HW and compare... –
Ellie I got the code here it is
module sqrt(
input[31:0]a,
output[15:0]out
);
reg [31:0]temp;
reg[14:0]x;
always@(a)
begin
if(a<257)x=4;
if(a>256 && a<65537)x=80;
if(a>65536 && a<16777217)x=1000;
if(a>16777216 && a<=4294967295)x=20000;
temp=(x+(a/x))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
end
assign out=temp;
endmodule
a/temp
) make this implementation virtually unusable? My understanding is that true hardware implementations for integer sqrt and integer div are implemented using similar algorithms (testing bits in a result from highest one to lowest one) and should have similar latency. –
Wimsatt `likethis`
to make it readable –
Charlottecharlottenburg The usual means of doing this in hardware is using a CORDIC. A general implementation allows the calculation of a variety of transcendental functions (cos/sin/tan) and... square roots depending on how you initialize and operate the CORDIC.
It's an iterative algorithm so to do it in a single cycle you'd unroll the loop into as many iterations as you require for your desired precision and chain the instances together.
Specifically if you operated the CORDIC in vectoring mode, initialize it with [x, 0] and rotate to 45 degrees the [x', y'] final output will be a multiplicative constant away. i.e. sqrt(x) = x' * sqrt(2) * K
one clock cycle only?
–
Classify My version of Spektre with variable bits count in so it can be faster on short squares.
const unsigned int isqrt_lut[16] =
{
// m*m
0x40000000,
0x10000000,
0x04000000,
0x01000000,
0x00400000,
0x00100000,
0x00040000,
0x00010000,
0x00004000,
0x00001000,
0x00000400,
0x00000100,
0x00000040,
0x00000010,
0x00000004,
0x00000001,
};
/// Our largest golf ball image is about 74 pixels, so lets round up to power of 2 and we get 128.
/// 128 squared is 16384 so out largest sqrt has to handle 16383 or 14 bits. Only positive values.
/// ** maxBits in is 2 to 32 always an even number **
/// Input value mist always be less than (2^maxBits) - 1
unsigned int isqrt(unsigned int xx, int maxBitsIn) {
DWORD x, m, a0, a1, i;
for (x = 0, a0 = 0, m = 0x01 << (maxBitsIn / 2 - 1), i = 16 - maxBitsIn / 2; m; m >>= 1, i++)
{
a1 = a0 + isqrt_lut[i] + (x << (16 - i));
if (a1 <= xx) {
a0 = a1;
x |= m;
}
}
return x;
}
one clock cycle only
? –
Classify © 2022 - 2024 — McMap. All rights reserved.
1T
of clock then the only way I can think of is2^32 or 2^31 or 2^30 x 2 Byte
precomputed LUT table in some ROM size2-8GB
depend on signed/unsigned type and if you want to ignore the LSB. Any approach to compute this I know of needs few iterations (approximation) or up to16T
for binary search. If different1T
approach exist then is probably based on PCA and need different LUT tables. (sorry for the multiple edits of this comment ...) – Ellie16b x 16b = 32b
too, but that can be done in just fewT
too (2-3T if I imagine it correctly). – Ellieinput of 32 bit
doesn't much to specify number representation (Spektre seems to assume integer). Using linear approximation, the table(s) needed shouldn't be quite as large. – Classify