Picking good first estimates for Goldschmidt division

Asked 18/4, 2010 at 8:1 Answered 23/4, 2010 at 15:38

I'm calculating fixedpoint reciprocals in Q22.10 with Goldschmidt division for use in my software rasterizer on ARM.

This is done by just setting the numerator to 1, i.e the numerator becomes the scalar on the first iteration. To be honest, I'm kind of following the wikipedia algorithm blindly here. The article says that if the denominator is scaled in the half-open range (0.5, 1.0], a good first estimate can be based on the denominator alone: Let F be the estimated scalar and D be the denominator, then F = 2 - D.

But when doing this, I lose a lot of precision. Say if I want to find the reciprocal of 512.00002f. In order to scale the number down, I lose 10 bits of precision in the fraction part, which is shifted out. So, my questions are:

Is there a way to pick a better estimate which does not require normalization? Why? Why not? A mathematical proof of why this is or is not possible would be great.
Also, is it possible to pre-calculate the first estimates so the series converges faster? Right now, it converges after the 4th iteration on average. On ARM this is about ~50 cycles worst case, and that's not taking emulation of clz/bsr into account, nor memory lookups. If it's possible, I'd like to know if doing so increases the error, and by how much.

Here is my testcase. Note: The software implementation of clz on line 13 is from my post here. You can replace it with an intrinsic if you want. clz should return the number of leading zeros, and 32 for the value 0.

#include <stdio.h>
#include <stdint.h>

const unsigned int BASE = 22ULL;

static unsigned int divfp(unsigned int val, int* iter)
{
  /* Numerator, denominator, estimate scalar and previous denominator */
  unsigned long long N,D,F, DPREV;
  int bitpos;

  *iter = 1;
  D = val;
  /* Get the shift amount + is right-shift, - is left-shift. */
  bitpos = 31 - clz(val) - BASE;
  /* Normalize into the half-range (0.5, 1.0] */
  if(0 < bitpos)
    D >>= bitpos;
  else
    D <<= (-bitpos);

  /* (FNi / FDi) == (FN(i+1) / FD(i+1)) */
  /* F = 2 - D */
  F = (2ULL<<BASE) - D;
  /* N = F for the first iteration, because the numerator is simply 1.
     So don't waste a 64-bit UMULL on a multiply with 1 */
  N = F;
  D = ((unsigned long long)D*F)>>BASE;

  while(1){
    DPREV = D;
    F = (2<<(BASE)) - D;
    D = ((unsigned long long)D*F)>>BASE;
    /* Bail when we get the same value for two denominators in a row.
      This means that the error is too small to make any further progress. */
    if(D == DPREV)
      break;
    N = ((unsigned long long)N*F)>>BASE;
    *iter = *iter + 1;
  }
  if(0 < bitpos)
    N >>= bitpos;
  else
    N <<= (-bitpos);
  return N;
}


int main(int argc, char* argv[])
{
  double fv, fa;
  int iter;
  unsigned int D, result;

  sscanf(argv[1], "%lf", &fv);

  D = fv*(double)(1<<BASE);
  result = divfp(D, &iter); 

  fa = (double)result / (double)(1UL << BASE);
  printf("Value: %8.8lf 1/value: %8.8lf FP value: 0x%.8X\n", fv, fa, result);
  printf("iteration: %d\n",iter);

  return 0;
}

Stakeout answered 18/4, 2010 at 8:1 Comment(5)

Nominator? You mean numerator? :) – Ado 19/4, 2010 at 10:41

Whoops, I'll fix that right away, thanks. What a stupid error. – Stakeout 19/4, 2010 at 10:55

Is it beneficial to fix the number of iterations and unroll the while(1) loop? Just curious. – Yogh 19/4, 2010 at 17:8

phkahler: This C code was only made as a reference to understand the question. My implementation is written in ARM assembly, and yes, the loop is unrolled with just four iterations used. – Stakeout 19/4, 2010 at 18:21

I wouldn't mind seeing the ARM assembler as well. I see a lot of 64-bit math going on and that will eat registers and slow things down a bit in ways that C has a hard time expressing. – Alyshaalysia 22/4, 2010 at 18:23

I could not resist spending an hour on your problem...

This algorithm is described in section 5.5.2 of "Arithmetique des ordinateurs" by Jean-Michel Muller (in french). It is actually a special case of Newton iterations with 1 as starting point. The book gives a simple formulation of the algorithm to compute N/D, with D normalized in range [1/2,1[:

e = 1 - D
Q = N
repeat K times:
  Q = Q * (1+e)
  e = e*e

The number of correct bits doubles at each iteration. In the case of 32 bits, 4 iterations will be enough. You can also iterate until e becomes too small to modify Q.

Normalization is used because it provides the max number of significant bits in the result. It is also easier to compute the error and number of iterations needed when the inputs are in a known range.

Once your input value is normalized, you don't need to bother with the value of BASE until you have the inverse. You simply have a 32-bit number X normalized in range 0x80000000 to 0xFFFFFFFF, and compute an approximation of Y=2^64/X (Y is at most 2^33).

This simplified algorithm may be implemented for your Q22.10 representation as follows:

// Fixed point inversion
// EB Apr 2010

#include <math.h>
#include <stdio.h>

// Number X is represented by integer I: X = I/2^BASE.
// We have (32-BASE) bits in integral part, and BASE bits in fractional part
#define BASE 22
typedef unsigned int uint32;
typedef unsigned long long int uint64;

// Convert FP to/from double (debug)
double toDouble(uint32 fp) { return fp/(double)(1<<BASE); }
uint32 toFP(double x) { return (int)floor(0.5+x*(1<<BASE)); }

// Return inverse of FP
uint32 inverse(uint32 fp)
{
  if (fp == 0) return (uint32)-1; // invalid

  // Shift FP to have the most significant bit set
  int shl = 0; // normalization shift
  uint32 nfp = fp; // normalized FP
  while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead

  uint64 q = 0x100000000ULL; // 2^32
  uint64 e = 0x100000000ULL - (uint64)nfp; // 2^32-NFP
  int i;
  for (i=0;i<4;i++) // iterate
    {
      // Both multiplications are actually
      // 32x32 bits truncated to the 32 high bits
      q += (q*e)>>(uint64)32;
      e = (e*e)>>(uint64)32;
      printf("Q=0x%llx E=0x%llx\n",q,e);
    }
  // Here, (Q/2^32) is the inverse of (NFP/2^32).
  // We have 2^31<=NFP<2^32 and 2^32<Q<=2^33
  return (uint32)(q>>(64-2*BASE-shl));
}

int main()
{
  double x = 1.234567;
  uint32 xx = toFP(x);
  uint32 yy = inverse(xx);
  double y = toDouble(yy);

  printf("X=%f Y=%f X*Y=%f\n",x,y,x*y);
  printf("XX=0x%08x YY=0x%08x XX*YY=0x%016llx\n",xx,yy,(uint64)xx*(uint64)yy);
}

As noted in the code, the multiplications are not full 32x32->64 bits. E will become smaller and smaller and fits initially on 32 bits. Q will always be on 34 bits. We take only the high 32 bits of the products.

The derivation of 64-2*BASE-shl is left as an exercise for the reader :-). If it becomes 0 or negative, the result is not representable (the input value is too small).

EDIT. As a follow-up to my comment, here is a second version with an implicit 32-th bit on Q. Both E and Q are now stored on 32 bits:

uint32 inverse2(uint32 fp)
{
  if (fp == 0) return (uint32)-1; // invalid

  // Shift FP to have the most significant bit set
  int shl = 0; // normalization shift for FP
  uint32 nfp = fp; // normalized FP
  while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
  int shr = 64-2*BASE-shl; // normalization shift for Q
  if (shr <= 0) return (uint32)-1; // overflow

  uint64 e = 1 + (0xFFFFFFFF ^ nfp); // 2^32-NFP, max value is 2^31
  uint64 q = e; // 2^32 implicit bit, and implicit first iteration
  int i;
  for (i=0;i<3;i++) // iterate
    {
      e = (e*e)>>(uint64)32;
      q += e + ((q*e)>>(uint64)32);
    }
  return (uint32)(q>>shr) + (1<<(32-shr)); // insert implicit bit
}

Lycanthrope answered 23/4, 2010 at 15:38 Comment(3)

But in any event, because of the estimate 2 - D one have to reserve two bits, right? And there is no way to get a smaller estimate than ~1.111111111 ? – Stakeout 23/4, 2010 at 15:55

Here's your well-deserved bounty by the way. – Stakeout 23/4, 2010 at 15:55

Thanks. All values of E fit on 32 bits. You don't need extra bits to store E. Q starts at 2^32, and may actually reach 2^33 in the case where NFP is 0x80000000. Since this case can be handled separately, Q will always fit on 33 bits, with bit 32 always set. So you can actually represent Q on 32 bits with an implicit bit. I forgot to add that 64-2*BASE-shl can be tested before iterating. Another point: when the initial FP uses a small number of bits, you probably don't need 32 significant bits (but with no meaning) in Q. – Lycanthrope 23/4, 2010 at 16:8

A couple of ideas for you, though none that solve your problem directly as stated.

Why this algo for division? Most divides I've seen in ARM use some varient of


      adcs hi, den, hi, lsl #1
      subcc hi, hi, den
      adcs lo, lo, lo

repeated n bits times with a binary search off of the clz to determine where to start. That's pretty dang fast.

If precision is a big problem, you are not limited to 32/64 bits for your fixed point representation. It'll be a bit slower, but you can do add/adc or sub/sbc to move values across registers. mul/mla are also designed for this kind of work.

Again, not direct answers for you, but possibly a few ideas to go forward this. Seeing the actual ARM code would probably help me a bit as well.

Alyshaalysia answered 22/4, 2010 at 18:37 Comment(3)

Hm, binary search might be worth pursuing, as 32x32=64 multiplications on ARM really kills performance. On ARM9/ARMv4T they take 3+M cycles, which is 7 cycles worst case. – Stakeout 22/4, 2010 at 19:22

What is the reason for this particular method of division? Just curious because it's kinda awkward to do on the code side. – Alyshaalysia 22/4, 2010 at 21:26

FYI here is a link to a fast ARM divider that I've used in past projects: peter-teichmann.de/adiv1e.html – Alyshaalysia 22/4, 2010 at 21:29

Mads, you are not losing any precision at all. When you divide 512.00002f by 2^10, you merely decrease the exponent of your floating point number by 10. Mantissa remains the same. Of course unless the exponent hits its minimum value but that shouldn't happen since you're scaling to (0.5, 1].

EDIT: Ok so you're using a fixed decimal point. In that case you should allow a different representation of the denominator in your algorithm. The value of D is from (0.5, 1] not only at the beginning but throughout the whole calculation (it's easy to prove that x * (2-x) < 1 for x < 1). So you should represent the denominator with decimal point at base = 32. This way you will have 32 bits of precision all the time.

EDIT: To implement this you'll have to change the following lines of your code:

  //bitpos = 31 - clz(val) - BASE;
  bitpos = 31 - clz(val) - 31;
...
  //F = (2ULL<<BASE) - D;
  //N = F;
  //D = ((unsigned long long)D*F)>>BASE;
  F = -D;
  N = F >> (31 - BASE);
  D = ((unsigned long long)D*F)>>31;
...
    //F = (2<<(BASE)) - D;
    //D = ((unsigned long long)D*F)>>BASE;
    F = -D;
    D = ((unsigned long long)D*F)>>31;
...
    //N = ((unsigned long long)N*F)>>BASE;
    N = ((unsigned long long)N*F)>>31;

Also in the end you'll have to shift N not by bitpos but some different value which I'm too lazy to figure out right now :).

Normally answered 22/4, 2010 at 10:16 Comment(6)

I think you misunderstood something here. 512.00002 was just an example to visualize the precision. I'm not using floats at all. And as you see from my code the normalization is done by shifting so that the highest set bit equals the what's "1" in the base. Please read my question (and code) again :) – Stakeout 22/4, 2010 at 12:10

Removed the downvote, but doesn't really answer my question. Yes, you are right about using a high base, but you forget that one needs just as much precision for the integer part as the real part. Or else taking the reciprocal of big or small numbers won't work. – Stakeout 22/4, 2010 at 16:41

Furthermore, picking (1<<32) as the base doesn't work, because multiplying (1<<33) with (1<<33) would overflow a 64-bit multiply. You get such values when doing (2<<32) - D. You also need x bits for the integer part, or else it gets shifted out. Since I want a sign and a range of 8192, my final representation is Q13.16. The three "missing bits is due to the sign, and the representation of "2" when the value is normalized. So my question still stands. I want to make 2 smaller, so I don't lose bits, or get rid of this normalization. It's the same thing, depending on your point of view. – Stakeout 22/4, 2010 at 16:53

What integer part are you talking about? Variable D will never ever have any integer part, it's always lower than one. So you can use all 32 bits just for the real part. It might be a better idea to only use 31 bits so that you can do 2 - D without any additional shifting. Remember that you only want to use this representation for D and F, NOT for N (which will still be Q22.10 or whatever else you like). – Normally 22/4, 2010 at 19:12

I mean the integer part before the normalization happens, as well as the final value when you shift the value back to its original base. – Stakeout 22/4, 2010 at 19:20

As I've already said, you would only use this representation for the intermediate results (variables D and F). Perhaps I should post the complete modified code? – Normally 22/4, 2010 at 20:48

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags