A possible alternative is to compute sqrt((x*x+y*y)/2
instead, which scales all the possible vector magnitudes to range 0..255.
Two (fast) algorithms seem to give near perfect results, one with Cordic, another with maximum of dot products.
void cordic_it(uint16 &x, uint16 &y, int n) {
auto X = x + y >> n; // vsraq_n_u16(x, y, n) in arm neon
y = abs(y - x >> n); // vabdq_u16(y, x >> n) in arm neon
}
uint16_t scaled_magnitude_cordic(uint8_t x, uint8_t y) {
const int kRound = 1;
if (x < y) std::swap(x,y);
// multiply by factor of 256/sqrt(2) == 181.02
// then reduce by the gain of the cordic iterations of 1.16
// - with prescaling we also ensure, that the cordic iterations
// do not lose too much significant bits when shifting right
uint16_t X = x * 156, Y = y * 156;
// exactly 4 iterations. 3 is too little, 5 causes too much noise
for (int j = 1; j <= 4; j++) cordic_it(X,Y,j);
return (X+kRound) >> 8;
}
By altering kRound, one can tune up the results:
Histogram of real - approx: -1 0 1
kRound == 0 -> smaller code 1 46617 18918
kRound == 1 -> approx >= real 0 46378 19158
kRound == -73 -> balanced error 3695 58301 3540
When selecting kRound == 1
, one can fix all the results by
uint8_t fix_if_larger_by_one(uint8_t sqrt, uint8_t x, uint8_t y) {
auto P = (x*x + y*y) / 2;
auto Q = sqrt*sqrt;
return sqrt - (P < Q);
}
One can also calculate the square root by approximating the dot product of xa + yb for several angles, where the traditional approach is to use a single angle a = 1, b = 1/2
.
With 5 unique angles, for approximately the angles of [0 10 20 30 40]
or [5 15 25 35 45]
, one comes up with either set of coefficients, both of which produce near perfect result that are at most off by 1 unit.
1) [181 0], [178 31], [170 62], [157 91], [139 116]
2) [180 18], [175 46], [164 76], [148 104], [128 128]
Option 1 has 9 non-trivial coefficients, (although 62 == 31*2).
Option 2 has 8 non-trivial coefficients and which lends to the following implementation:
int approx(uint8_t x, uint8_t y) {
if (x < y) std::swap(x,y); // sort so that x >= y
auto a4 = (x + y) / 2; // vhaddq_u8(x,y) on Arm Neon
auto a0 = (x * 180 + y * 18) >> 8;
auto a1 = (x * 175 + y * 46) >> 8;
auto a2 = (x * 164 + y * 76) >> 8;
auto a3 = (x * 148 + y * 104) >> 8;
return max_of_five_elements(a0,a1,a2,a3,a4);
}
This set of mostly even coefficients converts quite nicely to SSSE3 instruction set with _mm_maddubs_epi16
and _mm_max_epu16
instrinsics: each dot product but a1
can be calculated readily with one instruction from interleaved x,y and interleaved coefficients. Naturally it makes more sense to calculate 16 adjacent approximations at the same time to combat the latencies and to not to waste any computations from _mm_packus_epi16
, sorting or averaging the uint8_t inputs.
auto a0 = _mm_maddubs_epi16(xy, coeffs0); // coeffs0 = 90 9 90 9 ...
auto a1 = _mm_maddubs_epi16(xy, coeffs1); // coeffs1 = 87 23 87 23 ...
auto a2 = _mm_maddubs_epi16(xy, coeffs2); // coeffs2 = 82 38 82 38 ...
auto a3 = _mm_maddubs_epi16(xy, coeffs3); // coeffs3 = 74 52 74 52 ...
auto a4 = _mm_maddubs_epi16(xy, coeffs4); // coeffs4 = 64 64 64 64 ...
a1 = _mm_add_epi16(a1, x_per_2); // LSB of the coefficient 87.5
// take the maximum, shift right by 7 and pack to uint8_t
a0 = _mm_max_epu16(a0, a1);
a0 = _mm_max_epu16(a0, a2);
a0 = _mm_max_epu16(a0, a3);
a0 = _mm_max_epu16(a0, a4);
a0 = _mm_srli_epi16(a0, 7);
a0 = _mm_packus_epi16(a0, a0);
Using just 8 coefficients is also suitable for ARM Neon implementation, which can now use 16-bit by 16-bit scalar multiplication, storing all the coefficients in just a single full width registers.
For perfect results the dot-product algorithm must be compensated to the other direction, as it may give values, that are just one element below the reference implementation of floor(sqrt((x*x+y*y)/2)
:
uint8_t fix_if_smaller_by_one(uint8_t sqrt, uint8_t x, uint8_t y) {
auto P = (x*x + y*y) / 2;
auto Q = (sqrt+1)*(sqrt+1);
return sqrt + (Q <= P);
}
Other approximation algorithms typically use either division, or scaling, which is difficult to vectorise in Intel before AVX2, due to lack of variable per-lane shift.
<x,y>
to some new vector<x1,0>
(or equivalently<0,y1>
.x1
(ory1
) gives the magnitude of the original vector, and CORDIC can be implemented without multiplies. I've never done it myself though, and have no idea how hard it is. – Wartowsqrt
- see answer below. – Crotchet