Pandas, astype(int) applied to float column returns negative numbers
Asked Answered
C

4

12

My task is to read data from excel to dataframe. The data is a bit messy and to clean that up I've done:

df_1 = pd.read_excel(offers[0])
df_1 = df_1.rename(columns={'Наименование [Дата Файла: 29.05.2019 время: 10:29:42 ]':'good_name', 
                     'Штрихкод':'barcode', 
                     'Цена шт. руб.':'price',
                     'Остаток': 'balance'
                    })
df_1 = df_1[new_columns]
# I don't know why but without replacing NaN with another char code doesn't work
df_1.barcode = df_1.barcode.fillna('_')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to numeric
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
df_1.head()

It returns column barcode with type float64 (why so?)

0    0.000000e+00
1    7.613037e+12
2    7.613037e+12
3    7.613034e+12
4    7.613035e+12
Name: barcode, dtype: float64

Then I try to convert that column to integer.

df_1.barcode = df_1.barcode.astype(int)

But I keep getting silly negative numbers.

df_1.barcode[0:5]
0             0
1   -2147483648
2   -2147483648
3   -2147483648
4   -2147483648

Name: barcode, dtype: int32

Thanks to @Will and @micric eventually I've got a solution.

df_1 = pd.read_excel(offers[0])
df_1 = df_1[new_columns]
# replacing NaN with 0, it'll help to convert the column explicitly to dtype integer
df_1.barcode = df_1.barcode.fillna('0')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to integer
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer')

Resume:

  • pd.to_numeric converts NaN to float64. As a result from column with both NaN and not-Nan values we should expect column dtype float64.
  • Check size of number you're dealing with. int32 has its limit, which is 2**32 = 4294967296. Thanks a lot for your help, guys!
Corporate answered 31/5, 2019 at 8:10 Comment(2)
Could you tell me the output of print(pd.__version__)?Pommard
pandas version is 0.20.3Corporate
S
4

Many questions in one.

So your expected dtype...

pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)

pd.to_numeric downcast to integer would give you an integer, however, you have NaNs in your data and pandas needs to use a float64 type to represent NaNs

Samos answered 31/5, 2019 at 8:28 Comment(1)
You're right, many thanks for your help. After editing code so it fills NaN first <df_1.barcode = df_1.barcode.fillna(1)>, then converts column to numeric, code works as it supposed, meaning eventually I've got output <barcode 9820 non-null int64>Corporate
D
14

That number is a 32 bit lower limit. Your number is out of the int32 range you are trying to use, so it returns you the limit (notice that 2**32 = 4294967296, divided by 2 2147483648 that is your number).

You should use astype(int64) instead.

Dagan answered 31/5, 2019 at 9:5 Comment(1)
It actually should be astype(np.int64), #43956835Picot
P
7

I ran into the same problem as OP, using

astype(np.int64)

solved mine, see the link here.

I like this solution because it's consistent with my habit of changing the column type of pandas column, maybe someone could check the performance of these solutions.

Picot answered 12/7, 2019 at 14:17 Comment(2)
Thanks a lot. This solved my problem. I was utterly confounded by this behavior as I was trying to convert an abnormally long ID field to integer.Retainer
Please be warned that this solution did not work for me while the accepted answer did. I have no idea why astype(np.int64) changed 1 to 0 seemingly at random.Regretful
S
4

Many questions in one.

So your expected dtype...

pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)

pd.to_numeric downcast to integer would give you an integer, however, you have NaNs in your data and pandas needs to use a float64 type to represent NaNs

Samos answered 31/5, 2019 at 8:28 Comment(1)
You're right, many thanks for your help. After editing code so it fills NaN first <df_1.barcode = df_1.barcode.fillna(1)>, then converts column to numeric, code works as it supposed, meaning eventually I've got output <barcode 9820 non-null int64>Corporate
G
0

There is another way to achieve the correct solution using plain Python:

import numpy as np
# take an integer larger than 2**64
a = np.array([1e22])
a.astype(np.int64)

Will give you array([-9223372036854775808]).

np.array(int(a)) # right value, but with dtype=object

The operation above will give you the correct integer value but with dtype object, but you can still treat as a numeric array and apply NumPy ufuncs to the array such as np.mean, etc.

Conclusion: Python integers can be larger than NumPy integers. Another way around this is to keep the numbers as floats. But be always careful when doing math with very large numbers!

>>> int(1e100)
10000000000000000159028911097599180468360808563945281389781327557747838772170381060813469985856815104
Geilich answered 5/6 at 13:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.