Pandas Float64 vs float64 dtypes (note capitalization) causing non-numeric errors?
Asked Answered
P

2

16

I was getting some weird errors that after much searching appeared to (maybe) come from my data not being considered numeric in some cases. This seems to be because I used Float64 dtype (which I thought was what I was supposed to do).

TLDR; What's the difference between Float64 and float64? Why is use of Float64 data breaking a lot of stuff, such as pd.interpolate? What is even the purpose of Float64 existing?

Example:

import pandas as pd
import numpy as np                                                             
                                                                                
TESTDATA = u"""\                                                                
    val1, val2, val3                                                            
     1.0,  2.0,  3.0                                                            
     4.0,  5.0,  6.0                                                            
     7.0,  8.0,  9.0                                                            
    10.0, NaN, 12.0                                                             
    13.0, 14.0, 15.0                                                            
"""                                                                             
                                                                                
df = pd.read_csv(StringIO(TESTDATA), sep=r",\s*", engine='python', dtype=pd.Floa
t64Dtype())                                                                     
                                                                                
print(df)                                                                       
print()                                                                         
print(df.dtypes) 

This outputs:

   val1  val2  val3
0   1.0   2.0   3.0
1   4.0   5.0   6.0
2   7.0   8.0   9.0
3  10.0  <NA>  12.0
4  13.0  14.0  15.0

val1    Float64
val2    Float64
val3    Float64
dtype: object

So far everything looks good (as expected), but now I try:

df.interpolate()

and get:

ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear

This was rather baffling to me until I came across other answers and realized that this error might be coming about because interpolate thought the data was non-numeric and was therefore limiting the valid fill methods to ffill/bfill.

So I found that the following works:

df = df.astype(np.float64).interpolate()                                             
print(df.dtypes)                                                                
print()                                                                         
print(df)

with output:

val1    float64
val2    float64
val3    float64
dtype: object

   val1  val2  val3
0   1.0   2.0   3.0
1   4.0   5.0   6.0
2   7.0   8.0   9.0
3  10.0  11.0  12.0
4  13.0  14.0  15.0

Note that giving it np.float64 or just float gives the same result.

Running pd.to_numeric(df.val1) on the Float64 dataframe returned a series that still has Float64 type, indicating that pandas does seem to recognize that Float64 is numeric.

Preachy answered 16/9, 2021 at 2:24 Comment(0)
O
6

If you don't see the point (no data loss) you can manually downcast the column to a standard numpy type by passing the column values through a numpy array and changing its type, here: to numpy.float64 (which reconstructs also the index):

df[col_name] = df[col_name].values.astype(float)

Or answered 25/12, 2021 at 16:19 Comment(1)
Is there any way this can go wrong?Alexine
W
5
In [52]: pd.Float64Dtype?
Init signature: pd.Float64Dtype()
Docstring:     
An ExtensionDtype for float64 data.

This dtype uses ``pd.NA`` as missing value indicator.

With a float dtype, the frame displays as

In [68]: df
Out[68]: 
   val1  val2  val3
0   1.0   2.0   3.0
1   4.0   5.0   6.0
2   7.0   8.0   9.0
3  10.0   NaN  12.0
4  13.0  14.0  15.0

where the NaN is the np.nan, a valid float.

In [71]: df
Out[71]: 
   val1  val2  val3
0   1.0   2.0   3.0
1   4.0   5.0   6.0
2   7.0   8.0   9.0
3  10.0  <NA>  12.0
4  13.0  14.0  15.0

where that <NA> is pandas._libs.missing.NAType

Your df.interpolate() error indicates that the extension dtype was not implemented for all operations. Some places suggest it is still experimental.

Winser answered 16/9, 2021 at 6:23 Comment(3)
For completeness sake, do you have any more insight into the purpose of these 'extension dtypes'? If they are not generally backwards compatible, or supported by the rest of the core package, they seems like more of an experimental feature aimed at solving some problem, but not something that should be presented to an average user (like me) as a default/standard way to handle floats. (That may or may not have been the intention, but when I type in pandas.Fl hit tab, and Float64Dtype comes up, you can see how that would give the impression it's the one I should use).Preachy
Data loaded into pandas often has undefined values. While np.nan can be used for floats, there isn't an equivalent for integers. So extending integers to handle some sort of <NA> flag makes sense - even if it doesn't work exactly like an int. I suppose the Float as added in the same way, even though it isn't needed as strongly. But I know numpy better than pandas, so can only speculate.Winser
see github.com/pandas-dev/pandas/issues/40252Tessler

© 2022 - 2024 — McMap. All rights reserved.