Pandas FutureWarning about concatenating DFs with NaN-only cols seems wrong
Asked Answered
V

1

8

I am getting Future Warning with Pandas 2.2.2 when I try to concatenate DFs with Floating Values and Nones.

But the same won't happen if I use INT instead of FLOAT

import pandas as pd

# Block with INT
df1 = pd.DataFrame({'A': [1], 'B': [4]})
df2 = pd.DataFrame({'A': [2], 'B': [None]})
print(len(pd.concat([df1, df2])))

# Block with FLOAT
df1 = pd.DataFrame({'A': [1], 'B': [4.0]})
df2 = pd.DataFrame({'A': [2], 'B': [None]})
print(len(pd.concat([df1, df2])))

Block with INT runs fine, while Block with FLOAT gives warning

Sample output

2
./test.py:18: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  print(len(pd.concat([df1, df2])))
2
  1. Why are INTs fine while Floats are not?
  2. The datatype of the None column seems inferred from other DFs in concatenation (see comments).
  3. If the entire data frame is empty or None, it can be checked and skipped from concatenation, so the warning makes sense. But when a few cols have valid data, it's not clear what the user is expected to do in the future with the NaN cols.

PS: I don't think this is a duplicate of this: Alternative to .concat() of empty dataframe, now that it is being deprecated?

Viscera answered 6/9, 2024 at 12:54 Comment(5)
Interestingly, in the first case, pandas is doing exactly what it is warning about in the second case: it considers the dytpes of empty column 'B' in the second dataframe. None is treated as NaN - float64 value; concat of Int64 and float64 lead to float64 column. So the resulting dataframe in the concat of the first example will have column 'B' as float64 - which may be unexpected for some (me included).Tedi
The question is a bit unclear. There's no bias or expectations when a library warns you they're going to change how things work in a future version. They're telling you what to expect in advance. It seems the real question here is why Ints and Floats behave differently and why None's behavior is determined by the other data. First, None has no type so can't be used to determine the type of B. Second, Ints have no NaN or Inf, those only exist in Floats, implemented in the CPU itself and part of the IEEE-754 standard.Danita
Check the actual contents and dtypes of the concatenated dataframes, not the length. You'll see that in the first case B is an object while in the second case it's a float64 with an actual NaN float valueDanita
So you are saying NaN is inferred for None in df2 in the second case, due to the presence of float in df1's B column? While in first case its non NaN so warning did not trigger?Viscera
In any case, why should it be an issue at all? If my entire DF is empty or None, I understand and can check for it before concatenating, but when few cols have data why give a warning about other empty/NaN-only cols?Viscera
F
3

This topic got discussed in the pandas issue 45637 (see also release notes for 1.4.3). I agree with the OP that this is an inconsistency. As outlined in the issue, there is also a different behaviour between Series vs DataFrame, i.e.

import pandas as pd
print(pd.__version__)  # 2.2.2

df1 = pd.DataFrame({'A': [1], 'B': [4.]})
df2 = pd.DataFrame({'A': [2], 'B': [None]})

pd.concat([df1['B'], df2['B']]).dtypes  # dtype('O') + no warning
pd.concat([df1, df2]).dtypes['B'] # dtype('float64') + warning

Solution 1

Use pandas functionality to exclude empty columns.

df = pd.concat([i.dropna(axis=1, how='all') for i in [df1, df2]])
print(df.dtypes, '\n'*2, df)
# A      int64
# B    float64
# dtype: object 
# 
#     A    B
# 0  1  4.0
# 0  2  NaN

Solution 2

Detect the empty object columns before the merge. Compared to solution 1 slower, but acts only on object columns by ignoring the other dtypes.

import pandas as pd

def drop_empty_object_columns(df):
    df_objects = df.select_dtypes(object)
    return df[df.columns.difference(df_objects.columns[df_objects.isna().all()])]

df = pd.concat([drop_empty_object_columns(i) for i in [df1, df2]])
print(df.dtypes, '\n'*2, df)
# prints the same as solution 1

Solution 3 (slow see below)

First convert all columns to object dtype and then let pandas infer the most appropriate dtype afterward.

df = pd.concat([i.astype(object) for i in [df1, df2]]).infer_objects()
print(df.dtypes, '\n'*2, df)
# prints the same as solution 1

Solution 4 - ignoring the warning (slow see below)

Alternatively, you can safely ignore deprecation warnings, if you are using a requirement file in your project. To disable a specific warning globally, use

import warnings
warnings.filterwarnings(
    "ignore",          
    category=FutureWarning, 
    # You can use a regex to match the specific warning message
    message="The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. *"
)

and locally:

with warnings.catch_warnings():
    warnings.filterwarnings(
        "ignore",          
        category=FutureWarning, 
        # You can use a regex to match the specific warning message
        message="The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. *"
    )
    df = pd.concat(df1, df2)

Speed comparison:

Solution 1 and 2 in the following example are over 20 times and 10 times faster, respectively, compared to ignoring the warnings.

n = int(1e7)
df1 = pd.DataFrame({'A': [1]*n, 'B': [4.]*n})
df2 = pd.DataFrame({'A': [2]*n, 'B': [None]*n})

%timeit -n 3 -r 3 pd.concat([i.dropna(axis=1, how='all') for i in [df1, df2]]) # Solution 1
%timeit -n 3 -r 3 pd.concat([drop_empty_object_columns(i) for i in [df1, df2]]) # Solution 2
%timeit -n 3 -r 3 pd.concat([i.astype(object) for i in [df1, df2]]).infer_objects() # Solution 3
%timeit -n 3 -r 3 pd.concat([df1, df2]) # Solution 4
Solution Code Time* [ms]
1 i.dropna(axis=1, how='all') 109 ± 2
2 drop_empty_object_columns(i) 200 ± 2
3 .astype(object) 4580 ± 30
4 default pd.concat 2720 ± 9

*: per loop (mean ± std. dev. of 3 runs, 3 loops each)

Flowing answered 12/9, 2024 at 10:35 Comment(8)
Seems a bit overkill, is it possible to catch and suppress this specific warning? currently, I am using a with block to ignore this particular warning, but is there a way to ignore it globally without loosing other future warnings?Viscera
Overkill definitely not in terms of coding lines ;). But of course, you can also do it that way. Added it to my answer together with a maybe not overkill solution that works if the functionality is really deprecated.Flowing
You can also ignore it with with block, that might help localizing the workaround, wanna add that too to answer?Viscera
Added it. However, I would suggest to use the proposed solution that drop empty object columns.Flowing
That seems too costly to me, with huge DFs, just to avoid a warning.Viscera
Detecting empty object columns is cheap, allocating memory for empty object columns and inverting the dtype afterwards (that's what pd.concat does in the background) is expensive.Flowing
ok will check it thank youViscera
Did it meet your testing criteria? If so, please accept the answer to help people in the future.Flowing

© 2022 - 2025 — McMap. All rights reserved.