I have to analyse a large amount of data using Python3 (PyPy implementation), where I do some operations on quite large floats, and must check if the results are close enough to integers.
To exemplify, say I'm generating random pairs of numbers, and checking if they form pythagorean triples (are sides of right triangles with integer sides):
from math import hypot
from pprint import pprint
from random import randrange
from time import time
def gen_rand_tuples(start, stop, amount):
'''
Generates random integer pairs and converts them to tuples of floats.
'''
for _ in range(amount):
yield (float(randrange(start, stop)), float(randrange(start, stop)))
t0 = time()
## Results are those pairs that results in integer hypothenuses, or
## at least very close, to within 1e-12.
results = [t for t in gen_rand_tuples(1, 2**32, 10_000_000) if abs((h := hypot(*t)) - int(h)) < 1e-12]
print('Results found:')
pprint(results)
print('finished in:', round(time() - t0, 2), 'seconds.')
Running it I got:
Python 3.9.17 (a61d7152b989, Aug 13 2023, 10:27:46)
[PyPy 7.3.12 with GCC 13.2.1 20230728 (Red Hat 13.2.1-1)] on linux
Type "help", "copyright", "credits" or "license()" for more information.
>>>
===== RESTART: /home/user/Downloads/pythagorean_test_floats.py ====
Results found:
[(2176124225.0, 2742331476.0),
(342847595.0, 3794647043.0),
(36.0, 2983807908.0),
(791324089.0, 2122279232.0)]
finished in: 2.64 seconds.
Fun, it ran fast, processing 10 million datapoints in a bit over 2 seconds, and I even found some matching data. The hypothenuse is apparently integer:
>>> pprint([hypot(*x) for x in results])
[3500842551.0, 3810103759.0, 2983807908.0, 2265008378.0]
But not really, if we check the results using the decimal arbitrary precision module, we see the results are not actually not close enough to integers:
>>> from decimal import Decimal
>>> pprint([(x[0]*x[0] + x[1]*x[1]).sqrt() for x in (tuple(map(Decimal, x)) for x in results)])
[Decimal('3500842551.000000228516418075'),
Decimal('3810103758.999999710375341513'),
Decimal('2983807908.000000217172157183'),
Decimal('2265008377.999999748566051441')]
So, I think the problem is the numbers are large enough to fall in the range where python floats lack precision, so false positives are returned.
Now, we can just change the program to use arbitrary precision decimals everywhere:
from decimal import Decimal
from pprint import pprint
from random import randrange
from time import time
def dec_hypot(x, y):
return (x*x + y*y).sqrt()
def gen_rand_tuples(start, stop, amount):
'''
Generates random integer pairs and converts them to tuples of decimals.
'''
for _ in range(amount):
yield (Decimal(randrange(start, stop)), Decimal(randrange(start, stop)))
t0 = time()
## Results are those pairs that results in integer hypothenuses, or
## at least very close, to within 1e-12.
results = [t for t in gen_rand_tuples(1, 2**32, 10_000_000) if abs((h := dec_hypot(*t)) - h.to_integral_value()) < Decimal(1e-12)]
print('Results found:')
pprint(results)
print('finished in:', round(time() - t0, 2), 'seconds.')
Now we don't get any false positives, but we take a large performance hit. What previously took a bit over 2s, now takes over 100s. It appears decimals are not JIT-friendly:
====== RESTART: /home/user/Downloads/pythagorean_test_dec.py ======
Results found:
[]
finished in: 113.82 seconds.
I found this answer to the question, CPython and PyPy Decimal operation performance, suggesting the use of double-double precision numbers as a faster, JIT-friendly alternative to decimals, to get better precision than built-in floats. So I pip installed the doubledouble third-party module, and changed the program accordingly:
from doubledouble import DoubleDouble
from decimal import Decimal
from pprint import pprint
from random import randrange
from time import time
def dd_hypot(x, y):
return (x*x + y*y).sqrt()
def gen_rand_tuples(start, stop, amount):
for _ in range(amount):
yield (DoubleDouble(randrange(start, stop)), DoubleDouble(randrange(start, stop)))
t0 = time()
print('Results found:')
results = [t for t in gen_rand_tuples(1, 2**32, 10_000_000) if abs((h := dd_hypot(*t)) - int(h)) < DoubleDouble(1e-12)]
pprint(results)
print('finished in:', round(time() - t0, 2), 'seconds.')
But I get this error:
======= RESTART: /home/user/Downloads/pythagorean_test_dd.py ======
Results found:
Traceback (most recent call last):
File "/home/user/Downloads/pythagorean_test_dd.py", line 24, in <module>
results = [t for t in gen_rand_tuples(1, 2**32, 10_000_000) if abs((h := dd_hypot(*t)) - int(h)) < DoubleDouble(1e-12)]
File "/home/user/Downloads/pythagorean_test_dd.py", line 24, in <listcomp>
results = [t for t in gen_rand_tuples(1, 2**32, 10_000_000) if abs((h := dd_hypot(*t)) - int(h)) < DoubleDouble(1e-12)]
TypeError: int() argument must be a string, a bytes-like object or a number, not 'DoubleDouble'
I think the problem is the module doesn't specify a conversion or rounding to the nearest integer method. The best I could write was an extremely contrived "int" function, that rounds a double-double to the nearest integer by doing a round-trip through string and decimals and back to DoubleDouble:
def contrived_int(dd):
rounded_str = (Decimal(dd.x) + Decimal(dd.y)).to_integral_value()
hi = float(rounded_str)
lo = float(Decimal(rounded_str) - Decimal(hi))
return DoubleDouble(hi, lo)
But it's very roundabout, defeats the purpose of sidesteping decimals and makes the progam even slower than the full-decimal version.
Then I ask, is there a fast way to round a double-double precision number to the nearest integer directly, without intermediate steps going through decimals or strings?