Pandas FutureWarning about concatenating DFs with NaN-only cols seems wrong

import pandas as pd # Block with INT df1 = pd.DataFrame({'A': [1], 'B': [4]}) df2 = pd.DataFrame({'A': [2], 'B': [None]}) print(len(pd.concat([df1, df2]))) # Block with FLOAT df1 = pd.DataFrame({'A': [1], 'B': [4.0]}) df2 = pd.DataFrame({'A': [2], 'B': [None]}) print(len(pd.concat([df1, df2])))

2 ./test.py:18: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. print(len(pd.concat([df1, df2]))) 2

This topic got discussed in the pandas issue 45637 (see also release notes for 1.4.3). I agree with the OP that this is an inconsistency. As outlined in the issue, there is also a different behaviour between Series vs DataFrame, i.e.

import pandas as pd
print(pd.__version__)  # 2.2.2

df1 = pd.DataFrame({'A': [1], 'B': [4.]})
df2 = pd.DataFrame({'A': [2], 'B': [None]})

pd.concat([df1['B'], df2['B']]).dtypes  # dtype('O') + no warning
pd.concat([df1, df2]).dtypes['B'] # dtype('float64') + warning

Solution 1

Use pandas functionality to exclude empty columns.

df = pd.concat([i.dropna(axis=1, how='all') for i in [df1, df2]])
print(df.dtypes, '\n'*2, df)
# A      int64
# B    float64
# dtype: object 
# 
#     A    B
# 0  1  4.0
# 0  2  NaN

Solution 2

Detect the empty object columns before the merge. Compared to solution 1 slower, but acts only on object columns by ignoring the other dtypes.

import pandas as pd

def drop_empty_object_columns(df):
    df_objects = df.select_dtypes(object)
    return df[df.columns.difference(df_objects.columns[df_objects.isna().all()])]

df = pd.concat([drop_empty_object_columns(i) for i in [df1, df2]])
print(df.dtypes, '\n'*2, df)
# prints the same as solution 1

Solution 3 (slow see below)

First convert all columns to object dtype and then let pandas infer the most appropriate dtype afterward.

df = pd.concat([i.astype(object) for i in [df1, df2]]).infer_objects()
print(df.dtypes, '\n'*2, df)
# prints the same as solution 1

Solution 4 - ignoring the warning (slow see below)

Alternatively, you can safely ignore deprecation warnings, if you are using a requirement file in your project. To disable a specific warning globally, use

import warnings
warnings.filterwarnings(
    "ignore",          
    category=FutureWarning, 
    # You can use a regex to match the specific warning message
    message="The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. *"
)

and locally:

with warnings.catch_warnings():
    warnings.filterwarnings(
        "ignore",          
        category=FutureWarning, 
        # You can use a regex to match the specific warning message
        message="The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. *"
    )
    df = pd.concat(df1, df2)

Speed comparison:

Solution 1 and 2 in the following example are over 20 times and 10 times faster, respectively, compared to ignoring the warnings.

n = int(1e7)
df1 = pd.DataFrame({'A': [1]*n, 'B': [4.]*n})
df2 = pd.DataFrame({'A': [2]*n, 'B': [None]*n})

%timeit -n 3 -r 3 pd.concat([i.dropna(axis=1, how='all') for i in [df1, df2]]) # Solution 1
%timeit -n 3 -r 3 pd.concat([drop_empty_object_columns(i) for i in [df1, df2]]) # Solution 2
%timeit -n 3 -r 3 pd.concat([i.astype(object) for i in [df1, df2]]).infer_objects() # Solution 3
%timeit -n 3 -r 3 pd.concat([df1, df2]) # Solution 4

Solution	Code	Time* [ms]
1	`i.dropna(axis=1, how='all')`	109 ± 2
2	`drop_empty_object_columns(i)`	200 ± 2
3	`.astype(object)`	4580 ± 30
4	default `pd.concat`	2720 ± 9

*: per loop (mean ± std. dev. of 3 runs, 3 loops each)

Solution 1

Solution 2

Solution 3 (slow see below)

Solution 4 - ignoring the warning (slow see below)

Speed comparison:

Recommended topics

Hot tags