pandas: merge (join) two data frames on multiple columns
Asked Answered
P

6

367

I am trying to join two pandas dataframes using two columns:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

but got the following error:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()

KeyError: '[B_1, c2]'

Any idea what should be the right way to do this?

Proliferate answered 23/1, 2017 at 20:32 Comment(1)
left_on and right_on should be a list of strings, not a string that looks like a list.Scientific
O
615

Try this

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

left_on : label or list, or array-like Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like Field names to join on in right DataFrame or vector/list of vectors per left_on docs

Oberheim answered 23/1, 2017 at 20:34 Comment(0)
H
23
  1. It merges according to the ordering of left_on and right_on, i.e., the i-th element of left_on will match with the i-th of right_on.

    In the example below, the code on the top matches A_col1 with B_col1 and A_col2 with B_col2, while the code on the bottom matches A_col1 with B_col2 and A_col2 with B_col1. Evidently, the results are different.

    res1

  2. As can be seen from the above example, if the merge keys have different names, all keys will show up as their individual columns in the merged dataframe. In the example above, in the top dataframe, A_col1 and B_col1 are identical and A_col2 and B_col2 are identical. In the bottom dataframe, A_col1 and B_col2 are identical and A_col2 and B_col1 are identical. Since these are duplicate columns, they are most likely not needed. One way to not have this problem from the beginning is to make the merge keys identical from the beginning. See bullet point #3 below.

  3. If left_on and right_on are the same col1 and col2, we can use on=['col1', 'col2']. In this case, no merge keys are duplicated.

    df1.merge(df2, on=['col1', 'col2'])
    

    res3

  4. You can also merge one side on column names and the other side on index too. For example, in the example below, df1's columns are matched with df2's indices. If the indices are named, as in the example below, you can reference them by name but if not, you can also use right_index=True (or left_index=True if the left dataframe is the one being merged on index).

    df1.merge(df2, left_on=['A_col1', 'A_col2'], right_index=True)
    # or
    df1.merge(df2, left_on=['A_col1', 'A_col2'], right_on=['B_col1', 'B_col2'])
    

    res3

  5. By using the how= parameter, you can perform LEFT JOIN (how='left'), FULL OUTER JOIN (how='outer') and RIGHT JOIN (how='right') as well. The default is INNER JOIN (how='inner') as in the examples above.

  6. If you have more than 2 dataframes to merge and the merge keys are the same across all of them, then join method is more efficient than merge because you can pass a list of dataframes and join on indices. Note that the index names are the same across all dataframes in the example below (col1 and col2). Note that the indices don't have to have names; if the indices don't have names, then the number of the multi-indices must match (in the case below there are 2 multi-indices). Again, as in bullet point #1, the match occurs according to the ordering of the indices.

    df1.join([df2, df3], how='inner').reset_index()
    

    res4

Hiawatha answered 20/1, 2023 at 22:53 Comment(0)
P
16

Another way of doing this:

new_df = A_df.merge(B_df, left_on=['A_c1','c2'], right_on = ['B_c1','c2'], how='left')
Photochromy answered 11/9, 2019 at 22:41 Comment(0)
A
15

you can use below which is short and simple to understand:

merged_data= df1.merge(df2, on=["column1","column2"])
Algy answered 3/2, 2022 at 16:49 Comment(0)
H
13

the problem here is that by using the apostrophes you are setting the value being passed to be a string, when in fact, as @Shijo stated from the documentation, the function is expecting a label or list, but not a string! If the list contains each of the name of the columns beings passed for both the left and right dataframe, then each column-name must individually be within apostrophes. With what has been stated, we can understand why this is inccorect:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

And this is the correct way of using the function:

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])
Hokum answered 5/9, 2019 at 18:55 Comment(0)
B
-1

this work for me , for n files xls

# all_reports_paths contain one array with all paths per files
for a in all_reports_paths:
    
    df.append( pd.read_excel(a,skiprows=X,skipfooter=X))

df_glob = pd.DataFrame(columns=columns)

for dataframe in df:

    df_glob = pd.concat([df_glob,pd.DataFrame(dataframe)],axis=0)

# finally df_glob contain all data
Begat answered 4/11, 2022 at 21:58 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Trailblazer

© 2022 - 2024 — McMap. All rights reserved.