I was getting some weird errors that after much searching appeared to (maybe) come from my data not being considered numeric in some cases. This seems to be because I used Float64 dtype (which I thought was what I was supposed to do).
TLDR; What's the difference between Float64 and float64? Why is use of Float64 data breaking a lot of stuff, such as pd.interpolate
? What is even the purpose of Float64 existing?
Example:
import pandas as pd
import numpy as np
TESTDATA = u"""\
val1, val2, val3
1.0, 2.0, 3.0
4.0, 5.0, 6.0
7.0, 8.0, 9.0
10.0, NaN, 12.0
13.0, 14.0, 15.0
"""
df = pd.read_csv(StringIO(TESTDATA), sep=r",\s*", engine='python', dtype=pd.Floa
t64Dtype())
print(df)
print()
print(df.dtypes)
This outputs:
val1 val2 val3
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
3 10.0 <NA> 12.0
4 13.0 14.0 15.0
val1 Float64
val2 Float64
val3 Float64
dtype: object
So far everything looks good (as expected), but now I try:
df.interpolate()
and get:
ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear
This was rather baffling to me until I came across other answers and realized that this error might be coming about because interpolate
thought the data was non-numeric and was therefore limiting the valid fill methods to ffill/bfill.
So I found that the following works:
df = df.astype(np.float64).interpolate()
print(df.dtypes)
print()
print(df)
with output:
val1 float64
val2 float64
val3 float64
dtype: object
val1 val2 val3
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
3 10.0 11.0 12.0
4 13.0 14.0 15.0
Note that giving it np.float64
or just float
gives the same result.
Running pd.to_numeric(df.val1)
on the Float64
dataframe returned a series that still has Float64
type, indicating that pandas does seem to recognize that Float64
is numeric.