NumPy or Pandas: Keeping array type as integer while having a NaN value
Asked Answered
A

10

237

Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?

In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.

Thoughts?

Things tried:

I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.

Anastasiaanastasie answered 18/7, 2012 at 18:30 Comment(3)
Could you use a numpy masked array?Roulade
I'll give it a try. I also tried the from_records function under pandas.DataFrame, with coerce_float=False, but no luck... it still makes the new data have type float64.Anastasiaanastasie
Yeah, no luck. Even with masked array, it still converts to float. It's looking like Pandas goes like this: "Is there a NaN anywhere? ... Then everything's a float." Hopefully there is a way around this.Anastasiaanastasie
F
124

This capability has been added to pandas beginning with version 0.24.

At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).

Fabrice answered 24/8, 2018 at 3:36 Comment(4)
For now you have to specify a special dtype like 'Int64' to make it work. It will be even better when it will be enabled by default.Hydromancy
This is great! There's a small issue though that PyCharm fails to display the dataframe in the debug window if used this way. You can see my answer for another question for how to force displaying it: #38957160 (the original problem there is different, but the solution for displaying the dataframe works)Vermiculation
Do I have to use 'Int64' or is there something like 'Int8'? It uses an insane amount of memory compared to np.float.Saucepan
'Int8' seems to work, but np.float still seems to load way faster. Issue seems to be that it isn't releasing memory inbetween. Assume garbage collector will eventually run.Saucepan
H
128

NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na

(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support )

Headpiece answered 18/7, 2012 at 18:43 Comment(2)
Hi Wes, is there any update on this? We run into issues that join columns are converted into either ints or floats, based on the existence of a NA value in the original list. (Creating issues later on when trying to merge these dataframes)Kursk
Updated link: pandas-docs.github.io/pandas-docs-travis/whatsnew/…Fabrice
F
124

This capability has been added to pandas beginning with version 0.24.

At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).

Fabrice answered 24/8, 2018 at 3:36 Comment(4)
For now you have to specify a special dtype like 'Int64' to make it work. It will be even better when it will be enabled by default.Hydromancy
This is great! There's a small issue though that PyCharm fails to display the dataframe in the debug window if used this way. You can see my answer for another question for how to force displaying it: #38957160 (the original problem there is different, but the solution for displaying the dataframe works)Vermiculation
Do I have to use 'Int64' or is there something like 'Int8'? It uses an insane amount of memory compared to np.float.Saucepan
'Int8' seems to work, but np.float still seems to load way faster. Issue seems to be that it isn't releasing memory inbetween. Assume garbage collector will eventually run.Saucepan
G
12

In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"

s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error 
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0      1
1      2
2    NaN
dtype: Int64

My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.

Gentes answered 1/7, 2019 at 18:53 Comment(1)
For future seekers, i was receiving errors with this approach. Then i noticed there was a difference on the case for the integer. Note that Int64 != int64. Hope it helps someoneTytybald
S
9

If performance is not the main issue, you can store strings instead.

df.col = df.col.dropna().apply(lambda x: str(int(x)) )

Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.

You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.

Seedy answered 8/12, 2014 at 23:40 Comment(0)
P
8

This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN

a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)

This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected

Psychiatry answered 12/1, 2018 at 13:8 Comment(0)
B
7

Pandas v0.24+

Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.

Pandas v0.23 and earlier

In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.

The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:

s = pd.Series([1, 2, 3, np.nan])

print(s.astype(object))

0      1
1      2
2      3
3    NaN
dtype: object

For cosmetic reasons, e.g. output to a file, this may be preferable.

Pandas v0.23 and earlier: background

NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:

In the absence of high performance NA support being built into NumPy from the ground up, the primary casualty is the ability to represent NAs in integer arrays.

This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”.

The docs also provide rules for upcasting due to NaN inclusion:

Typeclass   Promotion dtype for storing NAs
floating    no change
object      no change
integer     cast to float64
boolean     cast to object
Bennet answered 19/12, 2018 at 14:31 Comment(0)
W
3

New for Pandas v1.00 +

You do not (and can not) use numpy.nan any more. Now you have pandas.NA.

Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

IntegerArray is currently experimental. Its API or implementation may change without warning.

Changed in version 1.0.0: Now uses pandas.NA as the missing value rather than numpy.nan.

In Working with missing data, we saw that pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers.

Westernmost answered 26/4, 2021 at 16:35 Comment(0)
S
2

If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64

This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls

import pandas as pd
import numpy as np

#show datatypes before transformation
mydf.dtypes

for c in mydf.select_dtypes(np.number).columns:
    try:
        mydf[c] = mydf[c].astype('Int64')
        print('casted {} as Int64'.format(c))
    except:
        print('could not cast {} to Int64'.format(c))

#show datatypes after transformation
mydf.dtypes
Scrawny answered 17/6, 2020 at 14:33 Comment(0)
C
0

This is now possible, since pandas v 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values.

Controversy answered 25/1, 2019 at 17:8 Comment(1)
What does this add that techvslife's answer doesn't? Please don't post duplicate answers.Wiltz
A
0

I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.

In Polars any missing values in an integer column are simply null values and the column remains an integer column.

See Polars - User Guide > Coming from Pandas for more info.

Astern answered 18/8, 2022 at 14:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.