I am having problems reading probabilities from CSV using pandas.read_csv
; some of the values are read as floats with > 1.0
.
Specifically, I am confused about the following behavior:
>>> pandas.read_csv(io.StringIO("column\n0.99999999999999998"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n0.99999999999999999"))["column"][0]
1.0000000000000002
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000000"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000001"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000008"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000009"))["column"][0]
1.0000000000000002
Default float-parsing behavior seems to be non-monotonic, and especially some values starting 0.9...
are converted to floats that are strictly greater than 1.0
, causing problems e.g. when feeding them into sklearn.metrics
.
The documentation states that read_csv
has a parameter float_precision
that can be used to select “which converter the C engine should use for floating-point values”, and setting this to 'high'
indeed solves my problem.
However, I would like to understand the default behavior:
- Where can I find the source code of the default float converter?
- Where can I find documentation on the intended behavior of the default float converter and the other possible choices?
- Why does a single-figure change in the least significant position skip a value?
- Why does this behave non-monotonically at all?
Edit regarding “duplicate question”: This is not a duplicate. I am aware of the limitations of floating-point math. I was specifically asking about the default parsing mechanism in Pandas, since the builtin float
does not show this behavior:
>>> float("0.99999999999999999")
1.0
...and I could not find documentation.