Efficient floating point comparison (Cortex-A8)
Asked Answered
G

4

6

There is a big (~100 000) array of floating point variables, and there is a threshold (also floating point).

The problem is that I have to compare each one variable from the array with a threshold, but NEON flags transfer takes a really long time (~20 cycles in accordance to a profiler).

Is there any efficient way to compare these values?

NOTE: As rounding error doesn't matter, I tried the following:

float arr[10000];
float threshold; 
....

int a = arr[20]; // e.g.
int t = threshold;
if (t > a) {....}

But in this case I getting the following processor command sequence:

vldr.32        s0, [r0]
vcvt.s32.f32   s0, s0
vmov           r0, s0    <--- takes 20 cycles as `vmrs APSR_nzcv, fpscr` in case of 
cmp            r0, r1         floating point comparison

As conversion happens at NEON, there is no matter if I compare integers, by described way or floats.

Gwenngwenneth answered 30/4, 2012 at 10:12 Comment(7)
People over at codereview.stackexchange.com might have to say something to this too.Sensuality
Your code is inconsistent with your problem statement - the data is float but you show threshold as int - you also cast each float data value to int - why ? If you data is float then the threshold should be float and you should do a float comparison (i.e. no int-float conversions). Also, what do you plan to do with values that are greater than (or less than) threshold (this will determine whether NEON is appropriate or not) ?Linneman
Many people ditch NEON for being slower than ARM without knowing what to avoid and how to program SIMD properly. Depending on what you exactly want, it's either not SIMD feasible to start with, or you don't know how to handle if-else with NEON.Reduction
if-else on NEON : 1. create a mask value with a VCnn instruction. 2. calculate results for both cases (if, else) in two different registers. 3. combine both results with a VBnn instruction using the mask value from 1.Reduction
Since you always have to calculate both cases on NEON regardless of the condition met, it might be slower than on ARM, if the calculations are very complex. If there are multiple associated if-elses, forget NEON.Reduction
@Jake'Alquimista'LEE unfortunately, I can't forget about NEON as I have to use it for all floating point calculations.Gwenngwenneth
NEON->ARM register transfers always suffer 11~14 cycles. This should be avoided at all costs within loops. And why those type castings when you can compare floats very fast even without rounding errors? I'll show you how in a separate answer.Reduction
W
5

If floats are 32-bit IEEE-754 and ints are 32-bit too and if there are no +infinity, -infinity and NaN values, we can compare floats as ints with a little trick:

#include <stdio.h>
#include <limits.h>
#include <assert.h>

#define C_ASSERT(expr) extern char CAssertExtern[(expr)?1:-1]
C_ASSERT(sizeof(int) == sizeof(float));
C_ASSERT(sizeof(int) * CHAR_BIT == 32);

int isGreater(float* f1, float* f2)
{
  int i1, i2, t1, t2;

  i1 = *(int*)f1;
  i2 = *(int*)f2;

  t1 = i1 >> 31;
  i1 = (i1 ^ t1) + (t1 & 0x80000001);

  t2 = i2 >> 31;
  i2 = (i2 ^ t2) + (t2 & 0x80000001);

  return i1 > i2;
}

int main(void)
{
  float arr[9] = { -3, -2, -1.5, -1, 0, 1, 1.5, 2, 3 };
  float thr;
  int i;

  // Make sure floats are 32-bit IEE754 and
  // reinterpreted as integers as we want/expect
  {
    static const float testf = 8873283.0f;
    unsigned testi = *(unsigned*)&testf;
    assert(testi == 0x4B076543);
  }

  thr = -1.5;
  for (i = 0; i < 9; i++)
  {
    printf("%f %s %f\n", arr[i], "<=\0> " + 3*isGreater(&arr[i], &thr), thr);
  }

  thr = 1.5;
  for (i = 0; i < 9; i++)
  {
    printf("%f %s %f\n", arr[i], "<=\0> " + 3*isGreater(&arr[i], &thr), thr);
  }

  return 0;
}

Output:

-3.000000 <= -1.500000
-2.000000 <= -1.500000
-1.500000 <= -1.500000
-1.000000 >  -1.500000
0.000000 >  -1.500000
1.000000 >  -1.500000
1.500000 >  -1.500000
2.000000 >  -1.500000
3.000000 >  -1.500000
-3.000000 <= 1.500000
-2.000000 <= 1.500000
-1.500000 <= 1.500000
-1.000000 <= 1.500000
0.000000 <= 1.500000
1.000000 <= 1.500000
1.500000 <= 1.500000
2.000000 >  1.500000
3.000000 >  1.500000

Of course, it makes sense to precalculate that final integer value in isGreater() that's used in the comparison operator if your threshold doesn't change.

If you are afraid of undefined behavior in C/C++ in the above code, you can rewrite the code in assembly.

Why answered 30/4, 2012 at 11:24 Comment(3)
Seems to be great idea. I still facing the issue with vmov.32 but mainly it's a good idea. Thanks.Gwenngwenneth
@vasile: What bug are you talking about? What is complex? If it is, how do you make it simpler?Why
I was referring to @Paul-R answer.Amberly
L
2

If your data is float then you should do your comparisons with floats, e.g.

float arr[10000];
float threshold;
....

float a = arr[20]; // e.g.
if (threshold > a) {....}

otherwise you will have expensive float-int conversions.

Linneman answered 30/4, 2012 at 10:31 Comment(6)
If I make a comparison between 2 floats it causes expensive flag-register transfer. That's why I tried to make a comparison between 2 integers.Gwenngwenneth
What operations are you subsequently performing when the threshold test is true/false ?Linneman
vcmpe.f32 s17, s16 vmrs APSR_nzcv, fpscr If I got your question.Gwenngwenneth
Alex: what I meant was: what happens in the { ... } after your threshold test ? This will determine whether using NEON for the test is appropriate or not.Linneman
There is a lot of code after test, that's why I marked it as ....Gwenngwenneth
Well if there's a lot of code then the cost of the test should be negligible (unless the overwhelming majority of data points do not need processing ?).Linneman
H
2

Your example shows how bad compiler-generated codes can be :

It loads a value with NEON just to convert it to int, then does a NEON->ARM transfer that causes a pipeline flush resulting in 11~14 cycles wasted.

The best solution would be writing the function completely in hand assembly.

However, there is a simple trick that allows fast float comparisons without typecasting AND truncation:

Threshold positive (exactly as fast as int comparison) :

void example(float * pSrc, float threshold, unsigned int count)
{
  typedef union {
    int ival,
    unsigned int uval,
    float fval
  } unitype;

  unitype v, t;
  if (count==0) return;
  t.fval = threshold;
  do {
    v.fval = *pSrc++;
    if (v.ival < t.ival) {
      // your code here
    }
    else {
      // your code here (optional)
    }
  } while (--count);
}

Threshold negative (1 cycle more per value than int comparison):

void example(float * pSrc, float threshold, unsigned int count)
{
  typedef union {
    int ival,
    unsigned int uval,
    float fval
  } unitype;

  unitype v, t, temp;
  if (count==0) return;
  t.fval = threshold;
  t.uval &= 0x7fffffff;
  do {
    v.fval = *pSrc++;
    temp.uval = v.uval ^ 0x80000000;
    if (temp.ival >= t.ival) {
      // your code here
    }
    else {
      // your code here (optional)
    }
  } while (--count);
}

I think it to be quite a lot faster than the accepted one above. Again, I'm a bit too late.

Harbard answered 2/5, 2012 at 7:49 Comment(0)
M
0

If the rounding errors do not matter, then you should use std::lrint.

The Faster Floating Point to Integer Conversions recommends to use it for float to int conversion.

Meliorate answered 30/4, 2012 at 10:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.