df = pd.DataFrame({'a':['x','x','x','x','x','y','y','y','y','y'],'b':['z','z','z','w','w','z','z','w','w','w'],'c':['c1','c2','c3','c1','c3','c1','c3','c1','c2','c3'],'d':range(1,11)})
a b c d
0 x z c1 1
1 x z c2 2
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
8 y w c2 9
9 y w c3 10
how can I keep only the rows that, for all combinations of a
and b
, contain the same values in c
? Or in other words, how to exclude rows with c
values that are only present in some combinations of a
and b
?
For example, only c1
and c3
are present in all combinations of a
and b
([x,z]
,[x,w]
,[y,z]
,[y,w]
), so the output would be
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
add
is not really safe, it would confuse('aa', 'b')
with('a', 'ab')
– Pyrophosphate