Before proceeding with this post, it is important to understand the difference between NaN and None. One is a float type, the other is an object type. Pandas is better suited to working with scalar types as many methods on these types can be vectorised. Pandas does try to handle None and NaN consistently, but NumPy cannot.
My suggestion (and Andy's) is to stick with NaN.
But to answer your question...
pandas >= 0.18: Use na_values=['-']
argument with read_csv
If you loaded this data from CSV/Excel, I have good news for you. You can quash this at the root during data loading instead of having to write a fix with code as a subsequent step.
Most of the pd.read_*
functions (such as read_csv
and read_excel
) accept a na_values
attribute.
file.csv
A,B
-,1
3,-
2,-
5,3
1,-2
-5,4
-1,-1
-,0
9,0
Now, to convert the -
characters into NaNs, do,
import pandas as pd
df = pd.read_csv('file.csv', na_values=['-'])
df
A B
0 NaN 1.0
1 3.0 NaN
2 2.0 NaN
3 5.0 3.0
4 1.0 -2.0
5 -5.0 4.0
6 -1.0 -1.0
7 NaN 0.0
8 9.0 0.0
And similar for other functions/file formats.
P.S.: On v0.24+, you can preserve integer type even if your column has NaNs (yes, talk about having the cake and eating it too). You can specify dtype='Int32'
df = pd.read_csv('file.csv', na_values=['-'], dtype='Int32')
df
A B
0 NaN 1
1 3 NaN
2 2 NaN
3 5 3
4 1 -2
5 -5 4
6 -1 -1
7 NaN 0
8 9 0
df.dtypes
A Int32
B Int32
dtype: object
The dtype is not a conventional int type... but rather, a Nullable Integer Type. There are other options.
Handling Numeric Data: pd.to_numeric
with errors='coerce
If you're dealing with numeric data, a faster solution is to use pd.to_numeric
with the errors='coerce'
argument, which coerces invalid values (values that cannot be cast to numeric) to NaN.
pd.to_numeric(df['A'], errors='coerce')
0 NaN
1 3.0
2 2.0
3 5.0
4 1.0
5 -5.0
6 -1.0
7 NaN
8 9.0
Name: A, dtype: float64
To retain (nullable) integer dtype, use
pd.to_numeric(df['A'], errors='coerce').astype('Int32')
0 NaN
1 3
2 2
3 5
4 1
5 -5
6 -1
7 NaN
8 9
Name: A, dtype: Int32
To coerce multiple columns, use apply
:
df[['A', 'B']].apply(pd.to_numeric, errors='coerce').astype('Int32')
A B
0 NaN 1
1 3 NaN
2 2 NaN
3 5 3
4 1 -2
5 -5 4
6 -1 -1
7 NaN 0
8 9 0
...and assign the result back after.
More information can be found in this answer.
write_frame
not parseNaN
s tonone
s? – DetailedInternalError: (1054, u"Unknown column 'nan' in 'field list'")
error. I don't know about any solutions on it other than convertingNaN
toNone
before executingwrite_frame
method. – Internunciona_values
argument. More information in this answer. – Hypoglycemia-1
is introduced on record 7), this is due to a default value of'pad'
as themethod
parameter. If a maintainer of pandas happens to be reading this, that default behavior is NOT desirable in virtually any environment I've experienced and easily results in data corruption! – Lengthen