Approximating the square root of sum of two squares on a microcontroller
Asked Answered
C

7

5

I'm working on implementing an FFT algorithm in assembly on an 8-bit microcontroller (HCS08) for fun. Once the algorithm is completed, I'll have an array of 8-bit real/imaginary pairs, and I'll want to find the magnitude of each of these values. That is, if x is complex, I want to find

|x| = sqrt(Re{x}^2 + Im{x}^2)  

Now I have available to me a 16-bit register and an 8-bit register. I thought about just squaring them, adding them, and taking the square root of the result, but that poses a problem: the maximum possible value of the sum of the squares of two 8-bit numbers is ~130k, which is greater than the maximum value a 16-bit register can hold (65.5k).

I came up with a subroutine which computes the integer square root of a 16-bit number, which seems to work well, but obviously I'm not guaranteed to be working with values that will fit into 16 bits. My thinking right now is that there is an algorithm which will approximate what I need directly, but I can't seem to find anything. Any ideas would be greatly appreciated.

To summarize: Say I have a vector with two 8-bit components, and I want to find the length of the vector. How can I approximate this without actually computing squares and square roots?

Thanks!

Counterirritant answered 3/4, 2011 at 5:30 Comment(5)
The CORDIC algorithm (en.wikipedia.org/wiki/CORDIC) can be used to rotate a vector <x,y> to some new vector <x1,0> (or equivalently <0,y1>. x1 (or y1) gives the magnitude of the original vector, and CORDIC can be implemented without multiplies. I've never done it myself though, and have no idea how hard it is.Wartow
Is this for audio, by any chance ? Are you going to be calculating log10 afterwards, in order to get a dB value ?Crotchet
Depends on the purpose: If you need the length, there is no other way then to compute, but when you rellly need the norm (which is usually the length), you could use another norma instead of the default L2 norm, that is e.g. manhattan distance (= |real|+|imag|).Mcilwain
@Paul R: Yes, this is for an audio project I'm working on. The hardware I'm interfacing with is expecting a linear voltage, and will convert it to a log scale.Counterirritant
@user599599: OK in that case you can probably get rid of the sqrt - see answer below.Crotchet
L
4

If the sum is greater than 65535, divide it by 4 (shift right 2 bits), take the square root, and multiply that by 2. You'll lose one bit of accuracy, and naturally the result is not guaranteed to fit into 8 bits.

Lilililia answered 3/4, 2011 at 5:36 Comment(2)
Thanks for the response. My only concern is if the sum is greater than 65535, it will overflow and I'll have no way of knowing. (I only have a 16-bit register, so adding two 16-bit numbers may yield unpredictable results.) I think I can accomplish the same thing by initially dividing Re{x} and Im{x} by 2, then multiplying the final answer by 2; does this sound equivalent to what you suggested?Counterirritant
You've already accepted this answer so I guess you figured out: divide the inputs by 4, and multiply the output by 2.Lilililia
J
7

There's a web page describing a Fast Magnitude Estimator. The basic idea is to get a least square (or other high quality) fit to the equation:

Mag ~= Alpha * max(|I|, |Q|) + Beta * min(|I|, |Q|)

for the coefficients Alpha and Beta. Several coefficient pairs are listed with mean square errors, max errors, etc., including coefficients suitable for integer ALUs.

Jubbulpore answered 3/4, 2011 at 7:53 Comment(1)
Looks like one of the 61/64 choices would be good for this application.Lilililia
L
4

If the sum is greater than 65535, divide it by 4 (shift right 2 bits), take the square root, and multiply that by 2. You'll lose one bit of accuracy, and naturally the result is not guaranteed to fit into 8 bits.

Lilililia answered 3/4, 2011 at 5:36 Comment(2)
Thanks for the response. My only concern is if the sum is greater than 65535, it will overflow and I'll have no way of knowing. (I only have a 16-bit register, so adding two 16-bit numbers may yield unpredictable results.) I think I can accomplish the same thing by initially dividing Re{x} and Im{x} by 2, then multiplying the final answer by 2; does this sound equivalent to what you suggested?Counterirritant
You've already accepted this answer so I guess you figured out: divide the inputs by 4, and multiply the output by 2.Lilililia
S
1

A possible alternative is to compute sqrt((x*x+y*y)/2 instead, which scales all the possible vector magnitudes to range 0..255.

Two (fast) algorithms seem to give near perfect results, one with Cordic, another with maximum of dot products.

void cordic_it(uint16 &x, uint16 &y, int n) {
    auto X = x + y >> n;  // vsraq_n_u16(x, y, n)  in arm neon
    y = abs(y - x >> n);  // vabdq_u16(y, x >> n)  in arm neon
}

uint16_t scaled_magnitude_cordic(uint8_t x, uint8_t y) {
     const int kRound = 1;
     if (x < y) std::swap(x,y);
     // multiply by factor of 256/sqrt(2) == 181.02
     // then reduce by the gain of the cordic iterations of 1.16
     // - with prescaling we also ensure, that the cordic iterations
     //   do not lose too much significant bits when shifting right
     uint16_t X = x * 156, Y = y * 156;
     // exactly 4 iterations. 3 is too little, 5 causes too much noise
     for (int j = 1; j <= 4; j++)  cordic_it(X,Y,j);
     return (X+kRound) >> 8;
}

By altering kRound, one can tune up the results:

     Histogram of real - approx:   -1    0       1 
kRound == 0 -> smaller code         1    46617   18918
kRound == 1 -> approx >= real       0    46378   19158
kRound == -73 ->  balanced error    3695 58301   3540

When selecting kRound == 1, one can fix all the results by

uint8_t fix_if_larger_by_one(uint8_t sqrt, uint8_t x, uint8_t y) {
    auto P = (x*x + y*y) / 2;
    auto Q = sqrt*sqrt;
    return sqrt - (P < Q);
}

One can also calculate the square root by approximating the dot product of xa + yb for several angles, where the traditional approach is to use a single angle a = 1, b = 1/2.

With 5 unique angles, for approximately the angles of [0 10 20 30 40] or [5 15 25 35 45], one comes up with either set of coefficients, both of which produce near perfect result that are at most off by 1 unit.

1) [181 0],  [178 31], [170 62], [157 91], [139 116]
2) [180 18], [175 46], [164 76], [148 104], [128 128]

Option 1 has 9 non-trivial coefficients, (although 62 == 31*2). Option 2 has 8 non-trivial coefficients and which lends to the following implementation:

int approx(uint8_t x, uint8_t y) {
     if (x < y) std::swap(x,y);  // sort so that x >= y
     auto a4 = (x + y) / 2;      // vhaddq_u8(x,y) on Arm Neon
     auto a0 = (x * 180 + y * 18) >> 8;
     auto a1 = (x * 175 + y * 46) >> 8;
     auto a2 = (x * 164 + y * 76) >> 8;
     auto a3 = (x * 148 + y * 104) >> 8;
     return max_of_five_elements(a0,a1,a2,a3,a4);
}

This set of mostly even coefficients converts quite nicely to SSSE3 instruction set with _mm_maddubs_epi16 and _mm_max_epu16 instrinsics: each dot product but a1 can be calculated readily with one instruction from interleaved x,y and interleaved coefficients. Naturally it makes more sense to calculate 16 adjacent approximations at the same time to combat the latencies and to not to waste any computations from _mm_packus_epi16, sorting or averaging the uint8_t inputs.

auto a0 = _mm_maddubs_epi16(xy, coeffs0); // coeffs0 = 90 9 90 9 ...
auto a1 = _mm_maddubs_epi16(xy, coeffs1); // coeffs1 = 87 23 87 23 ...
auto a2 = _mm_maddubs_epi16(xy, coeffs2); // coeffs2 = 82 38 82 38 ...
auto a3 = _mm_maddubs_epi16(xy, coeffs3); // coeffs3 = 74 52 74 52 ...
auto a4 = _mm_maddubs_epi16(xy, coeffs4); // coeffs4 = 64 64 64 64 ...
a1 = _mm_add_epi16(a1, x_per_2); // LSB of the coefficient 87.5

// take the maximum, shift right by 7 and pack to uint8_t
a0 = _mm_max_epu16(a0, a1);
a0 = _mm_max_epu16(a0, a2);
a0 = _mm_max_epu16(a0, a3);
a0 = _mm_max_epu16(a0, a4);
a0 = _mm_srli_epi16(a0, 7);
a0 = _mm_packus_epi16(a0, a0);

Using just 8 coefficients is also suitable for ARM Neon implementation, which can now use 16-bit by 16-bit scalar multiplication, storing all the coefficients in just a single full width registers.

For perfect results the dot-product algorithm must be compensated to the other direction, as it may give values, that are just one element below the reference implementation of floor(sqrt((x*x+y*y)/2):

uint8_t fix_if_smaller_by_one(uint8_t sqrt, uint8_t x, uint8_t y) {
    auto P = (x*x + y*y) / 2;
    auto Q = (sqrt+1)*(sqrt+1);
    return sqrt + (Q <= P);
}

Other approximation algorithms typically use either division, or scaling, which is difficult to vectorise in Intel before AVX2, due to lack of variable per-lane shift.

Skein answered 10/10, 2021 at 8:12 Comment(0)
D
0

Well, you can write x in polar form:

x = r[cos(w) + i sin(w)]

where w = arctan(Im(x)/Re(x)), so

|x| = r = Re(x)/cos(w)

No big numbers here but maybe you'll lose precision on trigonometric functions (that is, if you have access to trigonometric functions :-/ )

Decrescent answered 3/4, 2011 at 6:14 Comment(1)
Hmm, interesting idea. Unfortunately I don't have access to trig functions, and there isn't floating point support for the microcontroller anyway so I'm pretty much limited to basic integer operations. I am planning on having a trig lookup table though, so I'll keep that in mind.Counterirritant
C
0

A cheap and dirty method that may or may not be suitable is to use

|x| ~ max(|Re{x}|,|Im{x}|) + min(|Re{x}|,|Im{x})/2;

This will tend to overestimate |x| by somewhere between 0 and 12%.

Cranston answered 3/4, 2011 at 11:34 Comment(0)
C
0

If you're subsequently going to be converting the magnitude to dB, then you dispense with the sqrt operation altogether. I.e. if your calculation is:

magnitude = sqrt(re*re+im*im); // calculate magnitude of complex FFT output value
magnitude_dB = 20*log10(magnitude); // convert magnitude to dB

you can rewrite this as:

magnitude_sq = re*re+im*im; // calculate squared magnitude of complex FFT output value
magnitude_dB = 10*log10(magnitude_sq);  // convert squared magnitude to dB
Crotchet answered 4/4, 2011 at 6:41 Comment(2)
Good point, but my problem is that log10 is also a computationally expensive operation. I'd still have the problem of finding the closest integer or using a lookup table.Counterirritant
@user599599: yes, you still have the log, but previously you had sqrt + log and now you just have log.Crotchet
C
0

You might be limited with just 2 registers, but you could look at this code at http://www.realitypixels.com/turk/opensource/index.html Fixed Point Square Root Fixed-Point Trigonometric Functions using CORDIC

Coach answered 8/5, 2016 at 3:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.