How to convert an unsigned int to a float?

Asked 22/10, 2013 at 22:14 Answered 23/10, 2013 at 2:41

c binary type-conversion bit unsigned-integer

I need to build a function that returns the bit-level equivalent of (float)x without using any floating data types, operations or constants. I think I have it, but when I run the test file, it returns that there's an infinite loop. Any debugging help would be appreciated.

I'm allowed to use any integer/unsigned operations including ||, &&, if, while. Also, I can only use 30 operations

unsigned float_i2f(int x) {
    printf("\n%i", x);
    if (!x) {return x;}
    int mask1 = (x >> 31);
    int mask2 = (1 << 31);
    int sign = x & mask2;
    int complement = ~x + 1;
    //int abs = (~mask1 & x) + (mask1 & complement);
    int abs = x;
    int i = 0, temp = 0;
    while (!(temp & mask2)){
        temp = (abs <<i);
        i = i + 1;
    }
    int E = 32 - i;
    int exp = 127 + E;
    abs = abs & (-1 ^ (1 << E));
    int frac;
    if ((23 - E)>0)
        frac = (abs << (23 - E));
    else
        frac = (abs >> (E - 23));
    int rep = sign + (exp << 23) + frac;
    return rep;
}

In response to the very helpful comments and answers, here is the updated code, now only failing for 0x80000000:

unsigned float_i2f(int x) {
    int sign;
    int absX;
    int E = -1;
    int shift;
    int exp;
    int frac;
    // zero is the same in int and float:
    if (!x) {return x;}

    // sign is bit 31: that bit should just be transferred to the float:
    sign = x & 0x80000000;

    // if number is < 0, take two's complement:
    if (sign != 0) {
        absX = ~x + 1;
    }
    else
        absX = x;

    shift = absX;
    while ((!!shift) && (shift != -1)) {
        //std::cout << std::bitset<32>(shift) << "\n";
        E++;
        shift = (shift >> 1);
    }
    if (E == 30) { E++;}
    exp = E + 127+24;
    exp = (exp << 23);
    frac = (absX << (23 - E)) & 0x007FFFFF;
    return sign + exp + frac;
}

Anyone have any idea where the bug is in the revised code? Thank you all again!

Anticipative answered 22/10, 2013 at 22:14 Comment(4)

What do you mean by the "bit level equivalent". Could you give a couple of examples - "if input is this, I expect output to be that". Also - what is the evidence you have an infinite loop, and did you try printing out values inside that loop to figure out what is going on? – Silencer 22/10, 2013 at 22:36

I tried to clarify the question for the original poster (pending edit approval). – Recital 22/10, 2013 at 22:49

@Recital - thanks for the clarification. I didn't approve the edit since I'm not sure this is indeed what the OP wanted, but I based my answer on the assumption you are right... – Silencer 22/10, 2013 at 23:8

duplicates: How to manually (bitwise) perform (float)x?, Converting Int to Float or Float to Int using Bitwise operations (software floating point), Casting float to int (bitwise) in C – Bathurst 15/5, 2019 at 4:15

There is quite a lot you can do to improve your code and clean it up. For starters, add comments! Secondly, (and to reduce number of operations), you can combine certain things. Thirdly - differentiate between "integers that can be represented exactly" from "those that cannot".

Here is some sample code to put some of these things into practice; I could not actually compile and test this, so it's possible there are some bugs - I am trying to show an approach, not do your assignment for you...

unsigned float_i2f(int x) {
// convert integer to its bit-equivalent floating point representation
// but return it as an unsigned integer
// format: 
// 1 sign bit
// 8 exponent bits
// 23 mantissa bits (plus the 'most significant bit' which is always 1
printf("\n%i", x);

// zero is the same in int and float:
if (x == 0) {return x;}

// sign is bit 31: that bit should just be transferred to the float:
sign = x & 0x8000;

// if number is < 0, take two's complement:
int absX;
if(sign != 0) { 
  absX = ~x + 1;
}
else 
  absX = x;
}

// Take at most 24 bits:
unsigned int bits23 = 0xFF800000;
unsigned int bits24 = 0xFF000000;
unsigned E = 127-24;  // could be off by 1

// shift right if there are bits above bit 24:
while(absX & bits24) {
  E++;   // check that you add and don't subtract...
  absX >>= 1;
}
// shift left if there are no bits above bit 23:
// check that it terminates at the right point.
while (!(absX & bits23))
  E--;   // check direction
  absX <<= 1;
}

// now put the numbers we have together in the return value:
// check that they are truncated correctly
return sign | (E << 23) | (absX & ~bits23);

}

Silencer answered 22/10, 2013 at 23:6 Comment(7)

(The following presumes 32-bit int.) sign = x & 0x8000 should be sign = x & 0x80000000. absX = ~x + 1 overflows int when x is -2147483648. Even if no trap occurs, the later shifts of absX are troublesome, since the sign bit remains set. The shifts to limit the significand truncate but rounding is usually preferred. I have not checked for other bugs. – Numeration 23/10, 2013 at 1:22

Suggest E = 127+24-1, unsigned sign = x & 0x80000000; – Influential 23/10, 2013 at 1:34

@Floris, I reworked all of my code roughly following your guidelines. I really appreciate your help. It's different now, but is just failing for the value 0x80800000. How do I post my new code? Should I post it as a new question? Or can I edit my old one? – Anticipative 23/10, 2013 at 1:49

I see you figured out the answer to you last comment by yourself... This deserves more of my attention but I cannot give it tonight. Will take a lookin the morning if you are still struggling. In the meantime - print lots of debug statements (hexadecimal suggested) in your loops to see if thing are doing what you expect. You are clearly getting close... – Silencer 23/10, 2013 at 2:51

@Silencer I may need to ask a new question, but I thought I'd ask here first since you wrote the solution. What was the point of using the mask in the last line: (absX & ~bits23)? And does the 127-24 have to do with the bits range within the byte? Thanks. – Icaria 15/7, 2016 at 10:4

The mask is there because we only want to use the bottom 23 bits of the number - avoid "spill over" into the exponent field. – Silencer 15/7, 2016 at 11:16

The 127-24 is trying to get the value of the exponent right but as I said I may be off by one. If you look at how floating point is represented you should be able to figure it out. Sorry this is an old answer... – Silencer 15/7, 2016 at 11:18

Tried a solution that works for any size int.
Does not depend on 2's compliment.
Works with INT_MIN.
Learned much from @Floris

[Edit] Adjusted to do rounding and other improvements

#include <stdio.h>

int Round(uint32_t Odd, unsigned RoundBit, unsigned StickyBit, uint32_t Result);
int Inexact;

// Select your signed integer type: works with any one
//typedef int8_t integer;
//typedef int16_t integer;
//typedef int32_t integer;
typedef int64_t integer;
//typedef intmax_t integer;

uint32_t int_to_IEEEfloat(integer x) {
  uint32_t Result;
  if (x < 0) {  // Note 1
    Result = 0x80000000;
  } else {
    Result = 0;
    x = -x;  // Use negative absolute value. Note 2
  }
  if (x) {
    uint32_t Expo = 127 + 24 - 1;
    static const int32_t m2Power23 = -0x00800000;
    static const int32_t m2Power24 = -0x01000000;
    unsigned RoundBit = 0;
    unsigned StickyBit = 0;
    while (x <= m2Power24) {  // Note 3
      StickyBit |= RoundBit;
      RoundBit = x&1;
      x /= 2;
      Expo++;
    }
    // Round. Note 4
    if (Round(x&1, RoundBit, StickyBit, Result) && (--x <= m2Power24)) {
      x /= 2;
      Expo++;
    }
    if (RoundBit | StickyBit) {  // Note 5
      Inexact = 1; // TBD: Set FP inexact flag
    }
    int32_t i32 = x;  // Note 6
    while (i32 > m2Power23) {
      i32 *= 2;
      Expo--;
    }
    if (Expo >= 0xFF) {
      Result |= 0x7F800000; // Infinity  Note 7
    } else {
      Result |=  (Expo << 23) | ((-i32) & 0x007FFFFF);
    }
  }
  return Result;
}

/*
Note 1  If `integer` was a signed-magnitude or 1s compliment, then +0 and -0 exist.
Rather than `x<0`, this should be a test if the sign bit is set.
The following `if (x)` will not be taken on +0 and -0.
This provides the corresponding float +0.0 and -0.0 be returned.

Note 2 Overflow will _not_ occur using 2s compliment, 1s compliment or sign magnitude.
We are insuring x at this point is < 0.

Note 3 Right shifting may shift out a 1.  Use RoundBit and StickyBit to keep
track of bits shifted out for later rounding determination.

Note 4 Round as needed here.  Possible to need to shift once more after rounding.

Note 5 If either RoundBit or StickyBit set, the floating point inexact flag may be set.

Note 6 Since the `Integer` type maybe be less than 32 bits, we need to convert
to a 32 bit integer as IEEE float is 32 bits.FILE

Note 7 Infinity only expected in Integer was 129 bits or larger.
*/

int Round(uint32_t Odd, unsigned RoundBit, unsigned StickyBit, uint32_t Result) {
  // Round to nearest, ties to even
  return (RoundBit) && (Odd || StickyBit);

  // Truncate toward 0
  // return 0;

  // Truncate away from 0
  // return RoundBit | StickyBit

  // Truncate toward -Infinity
  // return (RoundBit | StickyBit) || Result
}

// For testing
float int_to_IEEEfloatf(integer x) {
  union {
    float f;
    uint32_t u;
  } xx;  // Overlay a float with a 32-bit unsigned integer
  Inexact = 0;
  printf("%20lld ", (long long) x);
  xx.u = int_to_IEEEfloat(x);
  printf("%08lX ", (long) xx.u);
  printf("%d : ", Inexact);
  printf("%.8e\n", xx.f);
  return xx.f;
}

int main() {
  int_to_IEEEfloatf(0x0);
  int_to_IEEEfloatf(0x1);
  int_to_IEEEfloatf(-0x1);
  int_to_IEEEfloatf(127);
  int_to_IEEEfloatf(-128);
  int_to_IEEEfloatf(12345);
  int_to_IEEEfloatf(32767);
  int_to_IEEEfloatf(-32768);
  int_to_IEEEfloatf(16777215);
  int_to_IEEEfloatf(16777216);
  int_to_IEEEfloatf(16777217);
  int_to_IEEEfloatf(2147483647L);
  int_to_IEEEfloatf(-2147483648L);
  int_to_IEEEfloatf( 9223372036854775807LL);
  int_to_IEEEfloatf(-9223372036854775808LL);
  return 0;
}

Influential answered 23/10, 2013 at 2:7 Comment(7)

Thanks for responding to my question! While your answer is great, I feel like I would be defeating the purpose of stack overflow if I just copied it. Is there any way you can spot what's missing in mine to help me debug it? Also, I need to round to even – Anticipative 23/10, 2013 at 2:13

@Acoustic77 What a stand-up programmer! Will review more. – Influential 23/10, 2013 at 2:14

@Acoustic77 Shift is messed up. You need a 2 way shift. Shift right halving the mantissa and incrementing the exponent until the mantissa's MSbit is 0x00800000. OR shift left (double the mantissa) and decrement exponent until MSbit is 0x00800000. – Influential 23/10, 2013 at 2:22

@Acoustic7 Suggest unsigned absX and unsigned shift. – Influential 23/10, 2013 at 2:30

ok I switched absX to unsigned absX. How do I add the second part of the shift? Does it matter since my E starts at -1? Maybe I should ditch the "exp" variable and just use E like @floris did? I think something is up with my E or exp that's causing the mess-up – Anticipative 23/10, 2013 at 2:32

let us continue this discussion in chat – Influential 23/10, 2013 at 2:35

Lots of good stuff here - like your " I want to learn how to do this" very much! – Silencer 23/10, 2013 at 2:52

When saying 30 operations do you count iterations of the loops?

if (!x) {return x;}

only handle the positive 0s. Why don't mask the sign and it'll work for both zeros

if (!(x & 0x7FFFFFFF)) {return x;}

Besides, many instructions are not needed, for example

complement = ~x + 1;

Just x = -x is enough because x isn't use anymore later, absX or complement is just redundant. And one negation instruction is faster than 2 operations, right?

!!shift is also slower than shift != 0. It's only useful when you need to use it as an expression of only 0 and 1, otherwise it's redundant.

Another problem is signed operations may sometimes slower than unsigned ones, so if when not necessary you shouldn't declare a variable as int. For example shift = (shift >> 1) will do an arithmetic shift (in most compiler implementations) which may cause unexpected result.

And to find the first bit set there are available instructions for that, no need for shift and test. Just find the bit position and shift the value once. If you're not allowed to use intrinsics then there are many fast ways to do that on Bit Twiddling Hacks too.

Bathurst answered 23/10, 2013 at 2:41 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags