Keep rows in data frame that, for all combinations of the values of certain columns, contain the same elements in another column
Asked Answered
U

9

20
df = pd.DataFrame({'a':['x','x','x','x','x','y','y','y','y','y'],'b':['z','z','z','w','w','z','z','w','w','w'],'c':['c1','c2','c3','c1','c3','c1','c3','c1','c2','c3'],'d':range(1,11)})

   a  b   c   d
0  x  z  c1   1
1  x  z  c2   2
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
8  y  w  c2   9
9  y  w  c3  10

how can I keep only the rows that, for all combinations of a and b, contain the same values in c? Or in other words, how to exclude rows with c values that are only present in some combinations of a and b?

For example, only c1 and c3 are present in all combinations of a and b ([x,z],[x,w],[y,z],[y,w]), so the output would be

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10
Ultramicroscope answered 13/1, 2021 at 18:8 Comment(0)
K
10

Here is one way. Get unique lists per group and then check common elements across all the returned arrays using reduce and np.intersect1d. Then filter the dataframe using series.isin and boolean indexing

from functools import reduce
out = df[df['c'].isin(reduce(np.intersect1d,df.groupby(['a','b'])['c'].unique()))]

Breakdown:

s = df.groupby(['a','b'])['c'].unique()
common_elements = reduce(np.intersect1d,s)
#Returns :-> array(['c1', 'c3'], dtype=object)

out = df[df['c'].isin(common_elements )]#.copy()

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10
Karafuto answered 13/1, 2021 at 18:25 Comment(0)
M
10

Lets try groupby with nunique to count of unique elements per column c group:

s = df['a'] + ',' + df['b'] # combination of a, b
m = s.groupby(df['c']).transform('nunique').eq(s.nunique())

df[m]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10
Monitorial answered 13/1, 2021 at 18:30 Comment(4)
add is not really safe, it would confuse ('aa', 'b') with ('a', 'ab')Pyrophosphate
@QuangHoang True. How about using an seperator like df['a'] + ',' + df['b']Monitorial
That'll be sensitive to , inside the field. Better do with tuple.Pyrophosphate
@QuangHoang That's very edge case, I think in that case using more complex seperator should work fine as well.Monitorial
U
9

Try something diff crosstab

s = pd.crosstab([df['a'],df['b']],df.c).all()
out = df.loc[df.c.isin(s.index[s])]
Out[34]: 
   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10
Upmost answered 13/1, 2021 at 19:9 Comment(0)
P
8

Let's try pivot the table, then drop NA, which means a value is missing in the combination:

all_data =(df.pivot(index=['a','b'], columns='c', values='c')
             .loc[:, lambda x: x.notna().all()]
             .columns)
df[df['c'].isin(all_data)]

Output:

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10
Pyrophosphate answered 13/1, 2021 at 18:30 Comment(1)
@Karafuto probably pivot does not allow same column and values. Use values='d' and aggfunc='size' with pivot_table instead.Pyrophosphate
G
7

We can use groupby + size and then unstack, which will fill NaN for groups of ['a', 'b'] that are missing a 'c' group. Then we dropna and subset the original DataFrame to the c values that survive the dropna.

df[df.c.isin(df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1).columns)]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

The result of the groupby operation contains columns only for groups of c that exist in all unique combinations of ['a', 'b'], so we just grab the columns attribute.

df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1)

#c     c1   c3
#a b          
#x w  1.0  1.0
#  z  1.0  1.0
#y w  1.0  1.0
#  z  1.0  1.0
Gahl answered 13/1, 2021 at 18:49 Comment(0)
B
3

You could use list comprehension with str.contains:

unq = [[x, len(df[(df[['a','b','c']].agg(','.join, axis=1)).str.contains(',' + x)]
                   .drop_duplicates())] for x in df['c'].unique()]
keep = [lst[0] for lst in unq if lst[1] == max([lst[1] for lst in unq])]
df = df[df['c'].isin(keep)]
df

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10
Boric answered 13/1, 2021 at 18:41 Comment(0)
B
1

If you make the below assumptions this works to give you which elements of column c to keep:

df.groupby("c")["a"].count() == df.groupby("c")["a"].count().max()

Output:

c
c1     True
c2    False
c3     True
Name: a, dtype: bool

Assumptions:

  1. There are no duplicates
  2. There is at least one value for column c that contains all combinations of a and b.
Bruis answered 13/1, 2021 at 18:21 Comment(0)
M
1

You can use value_counts and get all combinations of a and b:

vc = df[['a', 'b']].drop_duplicates().value_counts()

Result:

a  b
y  z    1
   w    1
x  z    1
   w    1

Then you can compare counts for each group with vc and filter out groups with missing combinations:

df.groupby('c').filter(lambda x: x[['a', 'b']].value_counts().ge(vc).all())

Output:

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10
Mantelletta answered 19/1, 2021 at 20:54 Comment(0)
N
-1

Assuming there are 4 distinct values as per the example:

A simple solution can be :

df[df['a'].groupby(df['c']).transform('count').eq(4)]
Nimwegen answered 13/1, 2021 at 19:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.