I am writing a program for an embedded hardware that only supports 32-bit single-precision floating-point arithmetic. The algorithm I am implementing, however, requires a 64-bit double-precision addition and comparison. I am trying to emulate double
datatype using a tuple of two float
s. So a double d
will be emulated as a struct
containing the tuple: (float d.hi, float d.low)
.
The comparison should be straightforward using a lexicographic ordering. The addition however is a bit tricky because I am not sure which base should I use. Should it be FLT_MAX
? And how can I detect a carry?
How can this be done?
Edit (Clarity): I need the extra significant digits rather than the extra range.
double
, or just the extra significant digits? – Redfishdouble
precision. Specifically,1.0E+20
and1.0E-03
differ by more than epsilon (fordouble
this is typically1.0E-16
or so) so I'd expect that operations like1.0E+20 + 1.0E-03
would equate to1.0E+20
, even when usingdouble
. Will that be an issue?? – Tricuspid1e30 + 1e-30
which is vastly larger than normal double significand. – Mirandamire