Finding common rows (intersection) in two Pandas dataframes
Asked Answered
R

5

97

Assume I have two dataframes of this format (call them df1 and df2):

+------------------------+------------------------+--------+
|        user_id         |      business_id       | rating |
+------------------------+------------------------+--------+
| rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA |      4 |
| C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA |      5 |
| mlBC3pN9GXlUUfQi1qBBZA | KoIRdcIfh3XWxiCeV1BDmA |      3 |
+------------------------+------------------------+--------+

I'm looking to get a dataframe of all the rows that have a common user_id in df1 and df2. (ie. if a user_id is in both df1 and df2, include the two rows in the output dataframe)

I can think of many ways to approach this, but they all strike me as clunky. For example, we could find all the unique user_ids in each dataframe, create a set of each, find their intersection, filter the two dataframes with the resulting set and concatenate the two filtered dataframes.

Maybe that's the best approach, but I know Pandas is clever. Is there a simpler way to do this? I've looked at merge but I don't think that's what I need.

Rolanderolando answered 27/10, 2013 at 14:3 Comment(0)
M
133

My understanding is that this question is better answered over in this post.

But briefly, the answer to the OP with this method is simply:

s1 = pd.merge(df1, df2, how='inner', on=['user_id'])

Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.

Marienbad answered 29/5, 2015 at 17:47 Comment(0)
A
19

If I understand you correctly, you can use a combination of Series.isin() and DataFrame.append():

In [80]: df1
Out[80]:
   rating  user_id
0       2  0x21abL
1       1  0x21abL
2       1   0xdafL
3       0  0x21abL
4       4  0x1d14L
5       2  0x21abL
6       1  0x21abL
7       0   0xdafL
8       4  0x1d14L
9       1  0x21abL

In [81]: df2
Out[81]:
   rating      user_id
0       2      0x1d14L
1       1    0xdbdcad7
2       1      0x21abL
3       3      0x21abL
4       3      0x21abL
5       1  0x5734a81e2
6       2      0x1d14L
7       0       0xdafL
8       0      0x1d14L
9       4  0x5734a81e2

In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)

In [83]: ind
Out[83]:
0     True
1    False
2     True
3     True
4     True
5    False
6     True
7     True
8     True
9    False
Name: user_id, dtype: bool

In [84]: df1[ind].append(df2[ind])
Out[84]:
   rating  user_id
0       2  0x21abL
2       1   0xdafL
3       0  0x21abL
4       4  0x1d14L
6       1  0x21abL
7       0   0xdafL
8       4  0x1d14L
0       2  0x1d14L
2       1  0x21abL
3       3  0x21abL
4       3  0x21abL
6       2  0x1d14L
7       0   0xdafL
8       0  0x1d14L

This is essentially the algorithm you described as "clunky", using idiomatic pandas methods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1 and df2 have no overlapping row indices, i.e., if

In [93]: df1.index & df2.index
Out[93]: Int64Index([], dtype='int64')

In fact, it won't give the expected output if their row indices are not equal.

Akihito answered 27/10, 2013 at 14:15 Comment(5)
Ah. I had thought about that, but it doesn't give me what I want. I'm looking to have the two rows as two separate rows in the output dataframe. This solution instead doubles the number of columns and uses prefixes. I don't think there's a way to use merge to have create the two separate rows.Rolanderolando
+1 for merge, but looks like OP wants a bit different output. I've created what looks like he need but I'm not sure it most elegant pandas solutionPrado
This will only work if df1 and df2 are of the same length. Otherwise, this df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id) won't be able to broadcast together.Cat
I think it's solvable using df1 = df1[df1.index.isin(df2.index)]; df2 = df2[df2.index.isin(df1.index)]Cat
Why do you need the & here? df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)Custommade
P
5

In SQL, this problem could be solved by several methods:

select * from df1 where exists (select * from df2 where df2.user_id = df1.user_id)
union all
select * from df2 where exists (select * from df1 where df1.user_id = df2.user_id)

or join and then unpivot (possible in SQL server)

select
    df1.user_id,
    c.rating
from df1
    inner join df2 on df2.user_i = df1.user_id
    outer apply (
        select df1.rating union all
        select df2.rating
    ) as c

Second one could be written in pandas with something like:

>>> df1 = pd.DataFrame({"user_id":[1,2,3], "rating":[10, 15, 20]})
>>> df2 = pd.DataFrame({"user_id":[3,4,5], "rating":[30, 35, 40]})
>>>
>>> df4 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df = pd.merge(df1, df2, on='user_id', suffixes=['_1', '_2'])
>>> df3 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df4 = df[['user_id', 'rating_2']].rename(columns={'rating_2':'rating'})
>>> pd.concat([df3, df4], axis=0)
   user_id  rating
0        3      20
0        3      30
Prado answered 27/10, 2013 at 14:38 Comment(1)
can the second method be optimised /shortened ?Avast
S
2

You can do this for n DataFrames and k colums by using pd.Index.intersection:

import pandas as pd
from functools import reduce
from typing import Union

def dataframe_intersection(
    dataframes: list[pd.DataFrame], by: Union[list, str]
) -> list[pd.DataFrame]:
    set_index = [d.set_index(by) for d in dataframes]
    index_intersection = reduce(pd.Index.intersection, [d.index for d in set_index])
    intersected = [df.loc[index_intersection].reset_index() for df in set_index]

    return intersected

df1 = pd.DataFrame({"user_id":[1,2,3], "business_id": ['a', 'b', 'c'], "rating":[10, 15, 20]})
df2 = pd.DataFrame({"user_id":[3,4,5], "business_id": ['c', 'd', 'e'], "rating":[30, 35, 40]})
df3 = pd.DataFrame({"user_id":[3,3,3], "business_id": ['f', 'c', 'f'], "rating":[50, 70, 80]})

df_list = [df1, df2, df3]

This gives

>>> pd.concat(dataframe_intersection(df_list, by='user_id'))
   user_id business_id  rating
0        3           c      20
0        3           c      30
0        3           f      50
1        3           c      70
2        3           f      80

And

>>> pd.concat(dataframe_intersection(df_list, by=['user_id', 'business_id']))
   user_id business_id  rating
0        3           c      20
0        3           c      30
0        3           c      70
Swallowtail answered 10/10, 2022 at 17:18 Comment(0)
D
0

This is simple solution:

df1[df1 == df2].dropna()

Damales answered 22/1, 2023 at 18:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.