Joining pandas DataFrames by Column names
Asked Answered
N

3

162

I have two DataFrames with the following column names:

frame_1:
event_id, date, time, county_ID

frame_2:
countyid, state

I would like to get a DataFrame with the following columns by left-joining on county_ID = countyid:

joined_dataframe:
event_id, date, time, county, state

I cannot figure out how to do it if the columns on which I want to join are not the index.

Nereid answered 4/12, 2013 at 12:35 Comment(0)
C
287

You can use the left_on and right_on options of pd.merge as follows:

pd.merge(frame_1, frame_2, left_on='county_ID', right_on='countyid')

Or equivalently with DataFrame.merge:

frame_1.merge(frame_2, left_on='county_ID', right_on='countyid')

I was not sure from the question if you only wanted to merge if the key was in the left hand DataFrame. If that is the case then the following will do that (the above will in effect do a many to many merge)

pd.merge(frame_1, frame_2, how='left', left_on='county_ID', right_on='countyid')

Or

frame_1.merge(frame_2, how='left', left_on='county_ID', right_on='countyid')
Chekiang answered 4/12, 2013 at 12:41 Comment(2)
To expand on this a little, if you want to specify an index on one side, you can use right_index=True.Cressi
@Woody What if frame_1 already has county_ID as index, and frame_2 already has countyid as index? I understand I can substitute frame_1.reset_index() for frame_1 (and same for frame_2) in your answer. But is there a more efficient way to join/merge without resetting index?Scalawag
O
6

you need to make county_ID as index for the right frame:

frame_2.join ( frame_1.set_index( [ 'county_ID' ], verify_integrity=True ),
               on=[ 'countyid' ], how='left' )

for your information, in pandas left join breaks when the right frame has non unique values on the joining column. see this bug.

so you need to verify integrity before joining by , verify_integrity=True

Oatis answered 4/12, 2013 at 12:41 Comment(0)
L
0

Merging using differently named columns duplicates columns; for example, after the call frame_1.merge(frame_2, how='left', left_on='county_ID', right_on='countyid'), both county_ID and countyid columns are created on the joined frame but they have exactly the same values for every row, so presumably only one of them is needed. To not have this problem from the beginning, rename the merge-key column be the same and merge on that column.

df1 = frame_1.rename(columns={'county_ID':'county'})
df2 = frame_2.rename(columns={'countyid':'county'})
joined_frame = df1.merge(df2, on='county', how='left')

res1

Also, if the second frame has only one new additional column (e.g. state) as in the OP, then you can map that column to frame_1 via the common column.

frame_1['state'] = frame_1['county_ID'].map(frame_2.set_index('countyid')['state'])

If county_ID and countyid were indices (not columns), then a straightforward join call does the job. By default, it performs left join.

joined_frame = frame_1.join(frame_2)

res2

One nice thing about join is that if you want to join multiple dataframes on index, then you can pass a list of dataframes and join efficiently (instead of multiple chained merge calls).

joined_frame = frame_1.join([frame_2, frame_3])
Luminance answered 13/2, 2023 at 23:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.