This topic got discussed in the pandas issue 45637 (see also release notes for 1.4.3). I agree with the OP that this is an inconsistency. As outlined in the issue, there is also a different behaviour between Series vs DataFrame, i.e.
import pandas as pd
print(pd.__version__) # 2.2.2
df1 = pd.DataFrame({'A': [1], 'B': [4.]})
df2 = pd.DataFrame({'A': [2], 'B': [None]})
pd.concat([df1['B'], df2['B']]).dtypes # dtype('O') + no warning
pd.concat([df1, df2]).dtypes['B'] # dtype('float64') + warning
Solution 1
Use pandas functionality to exclude empty columns.
df = pd.concat([i.dropna(axis=1, how='all') for i in [df1, df2]])
print(df.dtypes, '\n'*2, df)
# A int64
# B float64
# dtype: object
#
# A B
# 0 1 4.0
# 0 2 NaN
Solution 2
Detect the empty object columns before the merge. Compared to solution 1 slower, but acts only on object columns by ignoring the other dtypes.
import pandas as pd
def drop_empty_object_columns(df):
df_objects = df.select_dtypes(object)
return df[df.columns.difference(df_objects.columns[df_objects.isna().all()])]
df = pd.concat([drop_empty_object_columns(i) for i in [df1, df2]])
print(df.dtypes, '\n'*2, df)
# prints the same as solution 1
Solution 3 (slow see below)
First convert all columns to object dtype and then let pandas infer the most appropriate dtype afterward.
df = pd.concat([i.astype(object) for i in [df1, df2]]).infer_objects()
print(df.dtypes, '\n'*2, df)
# prints the same as solution 1
Solution 4 - ignoring the warning (slow see below)
Alternatively, you can safely ignore deprecation warnings, if you are using a requirement file in your project. To disable a specific warning globally, use
import warnings
warnings.filterwarnings(
"ignore",
category=FutureWarning,
# You can use a regex to match the specific warning message
message="The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. *"
)
and locally:
with warnings.catch_warnings():
warnings.filterwarnings(
"ignore",
category=FutureWarning,
# You can use a regex to match the specific warning message
message="The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. *"
)
df = pd.concat(df1, df2)
Speed comparison:
Solution 1 and 2 in the following example are over 20 times and 10 times faster, respectively, compared to ignoring the warnings.
n = int(1e7)
df1 = pd.DataFrame({'A': [1]*n, 'B': [4.]*n})
df2 = pd.DataFrame({'A': [2]*n, 'B': [None]*n})
%timeit -n 3 -r 3 pd.concat([i.dropna(axis=1, how='all') for i in [df1, df2]]) # Solution 1
%timeit -n 3 -r 3 pd.concat([drop_empty_object_columns(i) for i in [df1, df2]]) # Solution 2
%timeit -n 3 -r 3 pd.concat([i.astype(object) for i in [df1, df2]]).infer_objects() # Solution 3
%timeit -n 3 -r 3 pd.concat([df1, df2]) # Solution 4
Solution |
Code |
Time* [ms] |
1 |
i.dropna(axis=1, how='all') |
109 ± 2 |
2 |
drop_empty_object_columns(i) |
200 ± 2 |
3 |
.astype(object) |
4580 ± 30 |
4 |
default pd.concat |
2720 ± 9 |
*: per loop (mean ± std. dev. of 3 runs, 3 loops each)
object
while in the second case it's a float64 with an actual NaN float value – DanitaNaN
is inferred forNone
indf2
in the second case, due to the presence offloat
indf1
'sB
column? While in first case its nonNaN
so warning did not trigger? – Viscera