Convert pandas.Series from dtype object to float, and errors to nans [duplicate]
Asked Answered
F

3

56

Consider the following situation:

In [2]: a = pd.Series([1,2,3,4,'.'])

In [3]: a
Out[3]: 
0    1
1    2
2    3
3    4
4    .
dtype: object

In [8]: a.astype('float64', raise_on_error = False)
Out[8]: 
0    1
1    2
2    3
3    4
4    .
dtype: object

I would have expected an option that allows conversion while turning erroneous values (such as that .) to NaNs. Is there a way to achieve this?

Fag answered 20/9, 2014 at 20:1 Comment(0)
R
90

Use pd.to_numeric with errors='coerce'

# Setup
s = pd.Series(['1', '2', '3', '4', '.'])
s

0    1
1    2
2    3
3    4
4    .
dtype: object

pd.to_numeric(s, errors='coerce')

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
dtype: float64

If you need the NaNs filled in, use Series.fillna.

pd.to_numeric(s, errors='coerce').fillna(0, downcast='infer')

0    1
1    2
2    3
3    4
4    0
dtype: float64

Note, downcast='infer' will attempt to downcast floats to integers where possible. Remove the argument if you don't want that.

From v0.24+, pandas introduces a Nullable Integer type, which allows integers to coexist with NaNs. If you have integers in your column, you can use

pd.__version__
# '0.24.1'

pd.to_numeric(s, errors='coerce').astype('Int32')

0      1
1      2
2      3
3      4
4    NaN
dtype: Int32

There are other options to choose from as well, read the docs for more.


Extension for DataFrames

If you need to extend this to DataFrames, you will need to apply it to each row. You can do this using DataFrame.apply.

# Setup.
np.random.seed(0)
df = pd.DataFrame({
    'A' : np.random.choice(10, 5), 
    'C' : np.random.choice(10, 5), 
    'B' : ['1', '###', '...', 50, '234'], 
    'D' : ['23', '1', '...', '268', '$$']}
)[list('ABCD')]
df

   A    B  C    D
0  5    1  9   23
1  0  ###  3    1
2  3  ...  5  ...
3  3   50  2  268
4  7  234  4   $$

df.dtypes

A     int64
B    object
C     int64
D    object
dtype: object

df2 = df.apply(pd.to_numeric, errors='coerce')
df2

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

df2.dtypes

A      int64
B    float64
C      int64
D    float64
dtype: object

You can also do this with DataFrame.transform; although my tests indicate this is marginally slower:

df.transform(pd.to_numeric, errors='coerce')

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

If you have many columns (numeric; non-numeric), you can make this a little more performant by applying pd.to_numeric on the non-numeric columns only.

df.dtypes.eq(object)

A    False
B     True
C    False
D     True
dtype: bool

cols = df.columns[df.dtypes.eq(object)]
# Actually, `cols` can be any list of columns you need to convert.
cols
# Index(['B', 'D'], dtype='object')

df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
# Alternatively,
# for c in cols:
#     df[c] = pd.to_numeric(df[c], errors='coerce')

df

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

Applying pd.to_numeric along the columns (i.e., axis=0, the default) should be slightly faster for long DataFrames.

Radiancy answered 22/12, 2017 at 14:6 Comment(6)
This is awesome :-) , We should keep the thing updated , since left deprecated method on this site is not good for the future visitor :-) , You did itLilian
Aha, maybe you can add s.str.isalnum() :-) combine with maskLilian
@Wen You mean s.str.isdigit()? It will only work for integers, not floats. Good idea though.Radiancy
@Wen as per I know to_numeric is gold here.Milesmilesian
@cᴏʟᴅsᴘᴇᴇᴅ as of reference we can take this question https://mcmap.net/q/162886/-how-to-replace-39-any-strings-39-with-nan-in-pandas-dataframe-using-a-boolean-mask/4800652 to consideration.Milesmilesian
Thanks! However, using axis=0 yields an error in my Python 3.6.2: File "C:\Python\lib\site-packages\pandas\core\series.py", line 2294, in apply mapped = lib.map_infer(values, f, convert=convert_dtype) File "pandas\src\inference.pyx", line 1207, in pandas.lib.map_infer (pandas\lib.c:66124) File "C:\Python\lib\site-packages\pandas\core\series.py", line 2282, in <lambda> f = lambda x: func(x, *args, **kwds) TypeError: to_numeric() got an unexpected keyword argument 'axis'Christoffer
A
20
In [30]: pd.Series([1,2,3,4,'.']).convert_objects(convert_numeric=True)
Out[30]: 
0     1
1     2
2     3
3     4
4   NaN
dtype: float64
Asturias answered 20/9, 2014 at 20:6 Comment(5)
I think I took out the function of raise_on_error a while back. Doesn't do anything ATM.Asturias
I opened an issue for enhancement here: github.com/pydata/pandas/issues/8332, feel free to comment on proposed APIAsturias
On 0.14.1 it prevents it from throwing an exception. Without specifying it the astype statement would've raise an error.Fag
sorry you are right. That's why never JUST read the code, test it :)Asturias
.convert_objects() method has been deprecated since 0.17, pd.to_numeric is the new way to go.Mimas
O
-2

Do this:

pd.to_numeric(s, errors='coerce')

Optimum answered 6/11, 2022 at 6:53 Comment(1)
repetitive answer with no additional effortIntramundane

© 2022 - 2024 — McMap. All rights reserved.