How to remove a pandas dataframe from another dataframe
Asked Answered
W

10

70

How to remove a pandas dataframe from another dataframe, just like the set subtraction:

a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]

And now we have two pandas dataframe, how to remove df2 from df1:

In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
   a  b
0  1  2
1  3  4
2  5  6


In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
   a  b
0  1  2
1  5  6

Then we expect df1-df2 result will be:

In [14]: df
Out[14]:
   a  b
0  3  4

How to do it?

Thank you.

Watchword answered 19/5, 2016 at 3:54 Comment(2)
Possible duplicate of set difference for pandasWantage
@Watchword Please timeit our answers on your real datasets - it's interesting to me which is fastest/Shanghai
O
115

Solution

Use pd.concat followed by drop_duplicates(keep=False)

pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

It looks like

   a  b
1  3  4

Explanation

pd.concat adds the two DataFrames together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicates method. However, drop_duplicates by default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=False parameter which does exactly that.

A special note to the repeated df2. With only one df2 any row in df2 not in df1 won't be considered a duplicate and will remain. This solution with only one df2 only works when df2 is a subset of df1. However, if we concat df2 twice, it is guaranteed to be a duplicate and will subsequently be removed.

Opportunist answered 19/5, 2016 at 4:27 Comment(8)
thx, it works, and we can use pd.concat(df1,df2).drop_duplicates(keep=False) or df1.append(df2).drop_duplicates(keep=False)Watchword
@Watchword hopefully this answers your question. If not, let me know what remains unanswered and i'll do my best to address it.Opportunist
@Opportunist Your answer isn't correct - you made symmetric difference, not difference (simple).Shanghai
THis does not work. It the df you concatenate has additional records that are NOT in the checked df, then they will be Added to it...Ratib
The primary dataframe is df1. I'm concatenating 3 dataframes, df1 once and df2 twice. Because df2 is concatenated twice, by definition, everything in it will be duplicated. Therefore, dropping duplicates will leave NOTHING that was in df2.Opportunist
Further, the only issue that this does not address is if there are existing duplicates in the initial dataframe. This assumes there are no duplicates in the initial dataframe.Opportunist
Your answer is super clear. However, it returns TypeError: unhashable type: 'list' error when I try to drop duplicates. Can you check it? And if it is the case, can you update your answer?Popularity
@Popularity that implies that the items you are trying to check for duplication are are lists. If I'm correct, you need to change those items to a hashable type. If all items are lists, you may want to make them all tuples instead. If this doesn't make sense, feel free to ask a new question following the advice here -> minimal reproducible example. If you add a reply to this comment with a link to that question, I'll try to check it out and make sure you get your answer.Opportunist
T
17

You can use .duplicated, which has the benefit of being fairly expressive:

%%timeit
combined = df1.append(df2)
combined[~combined.index.duplicated(keep=False)]

1000 loops, best of 3: 875 µs per loop

For comparison:

%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']

100 loops, best of 3: 4.57 ms per loop


%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

1000 loops, best of 3: 987 µs per loop


%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]

1000 loops, best of 3: 546 µs per loop

In sum, using the np.array comparison is fastest. Don't need the .tolist() there.

Tumbler answered 19/5, 2016 at 19:4 Comment(2)
Be careful: This only works if the substracted Dataframe only contains data that is included in the first one. But I do like this answer.Surprising
Which of these would also work on Series?Earthlight
M
15

To get dataframe with all records which are in DF1 but not in DF2

DF=DF1[~DF1.isin(DF2)].dropna(how = 'all')
Midriff answered 1/9, 2020 at 21:6 Comment(2)
Such an elegant and pythonic solution. Works for me. ThanksVisibility
I wouldn't call pandas syntax pythonic at all, but this seems by far the best solution, not having to destroy your original dataframe duplicates to drop the contents of another dataframe. Got my upvote sirAdust
C
7

This solution works when your df_to_drop is a subset of main data frame data.

data_clean = data.drop(df_to_drop.index)

Creaturely answered 10/5, 2022 at 7:6 Comment(0)
O
6

A set logic approach. Turn the rows of df1 and df2 into sets. Then use set subtraction to define new DataFrame

idx1 = set(df1.set_index(['a', 'b']).index)
idx2 = set(df2.set_index(['a', 'b']).index)

pd.DataFrame(list(idx1 - idx2), columns=df1.columns)

   a  b
0  3  4
Opportunist answered 19/5, 2016 at 8:32 Comment(0)
O
2

A masking approach

df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)]

   a  b
1  3  4
Opportunist answered 19/5, 2016 at 8:43 Comment(1)
don't need .tolist().Tumbler
S
2

My shot with merge df1 and df2 from the question.

Using 'indicator' parameter

In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
Out[74]: 
   a  b
1  3  4
Shanghai answered 19/5, 2016 at 9:43 Comment(1)
An explanation of what is happening would make this a richer answer. You mentioned that the key to this method's success is the 'indicator' parameter, and setting that to true will add location information to each row, which your solution uses in the final step to filter, keeping only rows that appear only in the left data frame (indicator == 'left_only').Topnotch
C
0

I think the first tolist() needs to be removed, but keep the second one:

df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]
Clytemnestra answered 25/10, 2018 at 11:40 Comment(0)
F
0

An easiest option is to use indexes.

  1. Append df1 and df2 and reset their indexes.

    df = df1.concat(df2)
    df.reset_index(inplace=True)

  2. e.g:
    This will give df2 indexes

    indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]

Hope it will help to new readers, although the question posted a little time ago :)

Flare answered 14/11, 2018 at 21:17 Comment(0)
D
0

Solution if df1 contains duplicates + keeps the index.

A modified version of piRSquared's answer to keep the duplicates in df1 that do not appear in df2, while maintaining the index.

df1[df1.apply(lambda x: (x == pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)).all(1).any(), axis=1)]

If your dataframes are big, you may want to store the result of

pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)

in a variable before the df1.apply call.

Dahliadahlstrom answered 6/5, 2022 at 9:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.