I want to use numpy.where() to add a column to a pandas.DataFrame. I'd like to use NaN values for the rows where the condition is false (to indicate that these values are "missing").
Consider:
>>> import numpy; import pandas
>>> df = pandas.DataFrame({'A':[1,2,3,4]}); print(df)
A
0 1
1 2
2 3
3 4
>>> df['B'] = numpy.nan
>>> df['C'] = numpy.where(df['A'] < 3, 'yes', numpy.nan)
>>> print(df)
A B C
0 1 NaN yes
1 2 NaN yes
2 3 NaN nan
3 4 NaN nan
>>> df.isna()
A B C
0 False True False
1 False True False
2 False True False
3 False True False
Why does B show "NaN" but C shows "nan"? And why does DataFrame.isna() fail to detect the NaN values in C?
Should I use something other than numpy.nan inside where? None
and pandas.NA
both seem to work and can be detected by DataFrame.isna(), but I'm not sure these are the best choice.
Thank you!
Edit: As per @Tim Roberts and @DYZ, numpy.where returns an array of type string, so the str constructor is called on numpy.NaN. The values in column C are actually strings "nan". The question remains, however: what is the most elegant thing to do here? Should I use None
? Or something else?