Why does pd.concat change the resulting datatype from int to float?
Asked Answered
M

3

30

I have three dataframes: timestamp (with timestamps), dataSun (with timestamps of sunrise and sunset), dataData (with different climate data). Dataframe timestamp has datatype "int64".

timestamp.head() timestamp 0 1521681600000 1 1521681900000 2 1521682200000 3 1521682500000 4 1521682800000

Dataframe dataSun has also datatype "int64".

 dataSun.head()
         sunrise         sunset
0  1521696105000  1521740761000
1  1521696105000  1521740761000
2  1521696105000  1521740761000
3  1521696105000  1521740761000
4  1521696105000  1521740761000

Dataframe with climate data dataData has datatype "float64".

dataData.head()
           temperature     pressure  humidity
    0     2.490000  1018.000000      99.0
    1     2.408333  1017.833333      99.0
    2     2.326667  1017.666667      99.0
    3     2.245000  1017.500000      99.0
    4     2.163333  1017.333333      99.0
    5     2.081667  1017.166667      99.0

I want to concatenate these three dataframes in one.

dataResult = pd.concat((timestamp, dataSun, dataData), axis = 1)
dataResult.head()
       timestamp       sunrise        sunset  temperature     pressure     
0  1521681600000  1.521696e+12  1.521741e+12     2.490000  1018.000000   
1  1521681900000  1.521696e+12  1.521741e+12     2.408333  1017.833333   
2  1521682200000  1.521696e+12  1.521741e+12     2.326667  1017.666667   
3  1521682500000  1.521696e+12  1.521741e+12     2.245000  1017.500000   
4  1521682800000  1.521696e+12  1.521741e+12     2.163333  1017.333333   
5  1521683100000  1.521696e+12  1.521741e+12     2.081667  1017.166667   

weatherMeasurements.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7188 entries, 0 to 7187
Data columns (total 6 columns):
timestamp      7188 non-null int64
sunrise        7176 non-null float64
sunset         7176 non-null float64
temperature    7176 non-null float64
pressure       7176 non-null float64
humidity       7176 non-null float64
dtypes: float64(5), int64(1)

Why pd.concat has changes the datatype of the values DataSun? I have tried different ways to concatenate the dataframes. For example, I concatenated only timestamp and dataSun in one dataframe, then I concatenated resulted dataframe with dataData. But it was the same result. How can I concatenate three dataframes and secure the datatypes?

Millford answered 20/4, 2018 at 11:20 Comment(0)
I
19

Because of this -

timestamp      7188 non-null int64
sunrise        7176 non-null float64
...

timestamp has 7188 non-null values, while sunrise and onwards have 7176. It goes without saying that there are 12 values that are not non-null... meaning they're NaNs.

Since NaNs are of dtype=float, every other value in that column is automatically upcasted to float, and float numbers that big are usually represented in scientific notation.

That's the why, but that doesn't really solve your problem. Your options at this point are

  1. drop those rows with NaNs using dropna
  2. fill those NaNs with some default integeral value using fillna

(Now you may downcast these rows to int.)

  1. Alternatively, if you perform pd.concat with join='inner', NaNs are not introduced and the dtypes are preserved.

    pd.concat((timestamp, dataSun, dataData), axis=1, join='inner')
    
           timestamp        sunrise         sunset  temperature     pressure  \    
    0  1521681600000  1521696105000  1521740761000     2.490000  1018.000000   
    1  1521681900000  1521696105000  1521740761000     2.408333  1017.833333   
    2  1521682200000  1521696105000  1521740761000     2.326667  1017.666667   
    3  1521682500000  1521696105000  1521740761000     2.245000  1017.500000   
    4  1521682800000  1521696105000  1521740761000     2.163333  1017.333333   
    
       humidity  
    0      99.0  
    1      99.0  
    2      99.0  
    3      99.0  
    4      99.0 
    

With option 3, an inner join is performed on the indexes of each dataframe.

Iatrogenic answered 20/4, 2018 at 11:26 Comment(1)
You are absolutely right. I didn't check the code carefully. I have already corrected it and it works right. Thanks a lot!Millford
C
6

As of pandas 1.0.0 I believe you have another option, which is to first use convert_dtypes. This converts the dataframe columns to dtypes that support pd.NA, avoiding the issues with NaNs discussed in this answer.

Colloquium answered 22/3, 2020 at 17:12 Comment(0)
S
0

The current answers don't really deal with how to avoid the problem of ints turning into floats when doing a concat.

When you join/merge/concat two dataframes A and B

  • where A has and index value that is not in B
  • and B has a column of integers

Then the result will have missing values for the the row in A that is not in B. Since integer columns can not have missing values, they are automatically converted to floats.

You can avoid the missing values by an inner join, but that will then not have all the rows.

The suggestion of first doing an join and then filling the missing values with some appropriate integer (eg. 0) and finally reconverting the float columns back to integer has a problem. Integers (specifically int64 integers) can store values as large as 2**63)

9,223,372,036,854,775,807 = 2**63
    9,007,199,254,740,992 = 2**53

But once numbers get larger than 2**53, some integers can not be represented as float64 exactly. So in conversion the last three digits could get altered.

To avoid the conversion and reconversion, a solution is to make the indices of both dataframes identical. If you are doing a left join, this is as simple as

pd.concat([A, B.reindex(A.index, fill_value=0)])

If you want an outer join, the solution is

pd.concat([A, B.reindex(A.index.join(B.index, how='outer'))])

For an inner join, or a right join, there should be no problems in doing a simple concat, because no new index values will be created.

Sphygmograph answered 20/7 at 6:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.