How to do "(df1 & not df2)" dataframe merge in pandas?

Asked 20/9, 2015 at 5:10 Answered 12/5, 2022 at 11:50

Solved python join pandas merge dataframe

I have 2 pandas dataframes df1 & df2 with common columns/keys (x,y).

I want to merge do a "(df1 & not df2)" kind of merge on keys (x,y), meaning I want my code to return a dataframe containing rows with (x,y) only in df1 & not in df2.

SAS has an equivalent functionality

data final;
merge df1(in=a) df2(in=b);
by x y;
if a & not b;
run;

Who to replicate the same functionality in pandas elegantly? It would have been great if we can specify how="left-right" in merge().

Tobacconist answered 20/9, 2015 at 5:10 Comment(0)

I just upgraded to version 0.17.0 RC1 which was released 10 days ago. Just found out that pd.merge() have new argument in this new release called indicator=True to acheive this in pandonic way!!

df=pd.merge(df1,df2,on=['x','y'],how="outer",indicator=True)
df=df[df['_merge']=='left_only']

indicator: Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in 'left' DataFrame, right_only for observations whose merge key only appears in 'right' DataFrame, and both if the observation’s merge key is found in both.

http://pandas-docs.github.io/pandas-docs-travis/merging.html#database-style-dataframe-joining-merging

Tobacconist answered 20/9, 2015 at 5:32 Comment(3)

The correct way to filter a dataframe is as follows, you need to correct the second line of your code: df=df[df['_merge']=='left_only'] – Honeysuckle 1/6, 2016 at 11:20

Thanks for pointing out the typo. I have fixed it as per your comment. – Tobacconist 1/6, 2016 at 12:6

I found this link to be extremely helpful for this and other scenarios! kanoki.org/2019/07/04/pandas-difference-between-two-dataframes – Babyblueeyes 7/8, 2020 at 14:15

Another way to do this is to use the index.

If i1 and i2 are indices (sets of indices) then i1.difference(i2) represent those indices in i1 and not in i2. Then if df is a dataframe indexed by the same index type , for instance i1=df.index then pd.DataFrame(index=i1.difference(i2)).join(df) are those entries in df whose index is not in the index i2.

Bobinette answered 12/5, 2022 at 11:50 Comment(0)

Recommended topics

Hot tags