Pandas - group by id and drop duplicate with threshold

Asked 1/5, 2017 at 3:3 Answered 1/5, 2017 at 4:33

Solved python pandas group-by duplicates threshold

I have the following data:

userid itemid
  1       1
  1       1
  1       3
  1       4
  2       1
  2       2
  2       3

I want to drop userIDs who has viewed the same itemID more than or equal to twice. For example, userid=1 has viewed itemid=1 twice, and thus I want to drop the entire record of userid=1. However, since userid=2 hasn't viewed the same item twice, I will leave userid=2 as it is.

So I want my data to be like the following:

userid itemid
  2       1
  2       2
  2       3

Can someone help me?

import pandas as pd    
df = pd.DataFrame({'userid':[1,1,1,1, 2,2,2],
                   'itemid':[1,1,3,4, 1,2,3] })

Sabbath answered 1/5, 2017 at 3:3 Comment(1)

Please post reproducible code that creates that data. – Urethra 1/5, 2017 at 3:23

You can use duplicated to determine the row level duplicates, then perform a groupby on 'userid' to determine 'userid' level duplicates, then drop accordingly.

To drop without a threshold:

df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]

To drop with a threshold, use keep=False in duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:

df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]

The resulting output for no threshold:

   userid  itemid
4       2       1
5       2       2
6       2       3

Blackmore answered 1/5, 2017 at 3:24 Comment(2)

What if I want to drop regarding a threshold? That is, what if I want to drop userids with more than 3 duplicates? – Sabbath 1/5, 2017 at 3:26

This is a special-case trick for THRESHOLD = 2. For larger THRESHOLD, you need @DYZ's more general code. – Urethra 1/5, 2017 at 3:30

`filter`

Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.

filter and value_counts
Most generalizable and intuitive

df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)

filter and is_unique
special case when looking for n < 2

df.groupby('userid').filter(lambda x: x.itemid.is_unique)

   userid  itemid
4       2       1
5       2       2
6       2       3

Brieta answered 1/5, 2017 at 4:33 Comment(0)

Group the dataframe by users and items:

views = df.groupby(['userid','itemid'])['itemid'].count()
#userid  itemid
#1       1         2 <=== The offending row
#        3         1
#        4         1
#2       1         1
#        2         1
#        3         1
#Name: dummy, dtype: int64

Find out who saw any item only once:

THRESHOLD = 2
viewed = ~(views.unstack() >= THRESHOLD).any(axis=1)
#userid
#1    False
#2     True
#dtype: bool

Combine the results and keep the 'good' rows:

combined = df.merge(pd.DataFrame(viewed).reset_index())
combined[combined[0]][['userid','itemid']]
#   userid  itemid
#4       2       1
#5       2       2
#6       2       3

Sarraute answered 1/5, 2017 at 3:13 Comment(0)

# group userid and itemid and get a count
df2 = df.groupby(by=['userid','itemid']).apply(lambda x: len(x)).reset_index()
#Extract rows where the max userid-itemid count is less than 2.
df2 = df2[~df2.userid.isin(df2[df2.ix[:,-1]>1]['userid'])][df.columns]
print(df2)
   itemid  userid
3       1       2
4       2       2
5       3       2

If you want to drop at a certain threshold, just set

df2.ix[:,-1]>threshold]

Plumy answered 1/5, 2017 at 3:30 Comment(0)

I do not know whether there is a function available in Pandas to do this task. However, I tried to make a workaround to deal with your problem.

Here is the full code.

import pandas as pd
dictionary = {'userid':[1,1,1,1,2,2,2],
              'itemid':[1,1,3,4,1,2,3]}

df = pd.DataFrame(dictionary, columns=['userid', 'itemid'])

selected_user = []

for user in df['userid'].drop_duplicates().tolist():

    items = df.loc[df['userid']==user]['itemid'].tolist()
    if len(items) != len(set(items)): continue
    else: selected_user.append(user)

result = df.loc[(df['userid'].isin(selected_user))]

This code will result the following outcome.

    userid  itemid
4   2       1
5   2       2
6   2       3

Hope it helps.

Respect answered 1/5, 2017 at 3:48 Comment(1)

for user in df['userid'].drop_duplicates().tolist() can always be replaced by df.groupby('userid') – Urethra 1/5, 2017 at 7:18

`filter`

Recommended topics

Hot tags