The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.
Merge
Referring the documentation, pd.DataFrame.merge
takes right as a required argument, which you can think it as joining left table and right table according to some pre-defined structured join operation. Note the definition for parameter right.
Required Parameters
- right: DataFrame or named Series
Optional Parameters
- how: {‘left’, ‘right’, ‘outer’, ‘inner’} default ‘inner’
- on: label or list
- left_on: label or list, or array-like
- right_on: label or list, or array-like
- left_index: bool, default False
- right_index: bool, default False
- sort: bool, default False
- suffixes: tuple of (str, str), default (‘_x’, ‘_y’)
- copy: bool, default True
- indicator: bool or str, default False
- validate: str, optional
Important: pd.DataFrame.merge
requires right to be a pd.DataFrame
or named pd.Series
object.
Output
Furthermore, if we check the docstring for Merge Operation on pandas is below:
Perform a database (SQL) merge operation between two DataFrame or Series
objects using either columns as keys or their row indexes
Concat
Refer to documentation of pd.concat
, first note that the parameter is not named any of table, data_frame, series, matrix, etc., but objs instead. That is, you can pass many "data containers", which are defined as:
Iterable[FrameOrSeriesUnion], Mapping[Optional[Hashable], FrameOrSeriesUnion]
Required Parameters
- objs: a sequence or mapping of Series or DataFrame objects
Optional Parameters
- axis: {0/’index’, 1/’columns’}, default 0
- join: {‘inner’, ‘outer’}, default ‘outer’
- ignore_index: bool, default False
- keys: sequence, default None
- levels: list of sequences, default None
- names: list, default None
- verify_integrity: bool, default False
- sort: bool, default False
- copy: bool, default True
Output
- Returns: object, type of objs
Example
Code
import pandas as pd
v1 = pd.Series([1, 5, 9, 13])
v2 = pd.Series([10, 100, 1000, 10000])
v3 = pd.Series([0, 1, 2, 3])
df_left = pd.DataFrame({
"v1": v1,
"v2": v2,
"v3": v3
})
df_right = pd.DataFrame({
"v4": [5, 5, 5, 5],
"v5": [3, 2, 1, 0]
})
df_concat = pd.concat([v1, v2, v3])
# Performing operations on default
merge_result = df_left.merge(df_right, left_index=True, right_index=True)
concat_result = pd.concat([df_left, df_right], sort=False)
print(merge_result)
print('='*20)
print(concat_result)
Code Output
v1 v2 v3 v4 v5
0 1 10 0 5 3
1 5 100 1 5 2
2 9 1000 2 5 1
3 13 10000 3 5 0
====================
v1 v2 v3 v4 v5
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
You can achieve, however, the first output (merge) with concat by changing the axis parameter
concat_result = pd.concat([df_left, df_right], sort=False, axis=1)
Observe the following behavior,
concat_result = pd.concat([df_left, df_right, df_left, df_right], sort=False)
outputs;
v1 v2 v3 v4 v5
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
3 NaN NaN NaN 5.0 0.0
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
3 NaN NaN NaN 5.0 0.0
, which you cannot perform a similar operation with merge, since it only allows a single DataFrame or named Series.
merge_result = df_left.merge([df_right, df_left, df_right], left_index=True, right_index=True)
outputs;
TypeError: Can only merge Series or DataFrame objects, a <class 'list'> was passed
Conclusion
As you may have notice already that input and outputs may be different between "merge" and "concat".
As I mentioned at the beginning, the very first (main) difference is that "merge" performs a more structured join with a set of restricted set of objects and parameters where as "concat" performs a less strict/broader join with a broader set of objects and parameters.
All in all, merge is less tolerant to changes/(the input) and "concat" is looser/less sensitive to changes/(the input). You can achieve "merge" by using "concat", but the reverse is not always true.
"Merge" operation uses Data Frame columns (or name of pd.Series
object) or row indices, and since it uses those entities only it performs horizontal merge of Data Frames or Series, and does not apply vertical operation as a result.
If you want to see more, you can deep dive in the source code a bit;
.merge()
and.join()
. – Rhombencephalon