Here is an example data frame to show my problem and what I want to achieve.
Here I have two columns, x
and y
, that I want to remove duplicates from. I also have column z
that contains the a sorted rank of the rows.
x y z
1 A BB 8
2 B BB 7.5
3 B AA 6.2
4 B CC 5
5 C DD 4
6 D CC 3
I am trying to look at both x
and y
at the same time and every time there is a duplicate in either column then delete the row and keep going.
The end result I'm looking for is
x y z
1 A BB 8
3 B AA 6.2
5 C DD 4
6 D CC 3
The second BB
in column y
is removed. Then the B - AA
row is not removed since going down row by row it is now the first B
in the x
column. This is for a large dataset so unfortunately I can not do it by hand.
I am not trying to group the two columns together. I also do not want to delete duplicates one column at a time either since if that was done then it would remove too many observations.
How can this be achieved?
x
is "B" in rows 2,3, and 4. If you sort byx
first, rows 3 and 4 will disappear. But if you sort byy
first, row 3 will remain. What do you really want? For example, do you want to simultaneously checkx[j]
vsx[j+1]
andy[j]
vsy[j+1]
, removing the j+1 -th row if either is a dupe? – PredominanceA BB
at the top so it's fine. ThenB BB
and notice that this is the secondBB
so I delete the row. NextB AA
and see this is the secondB
and delete that row. Finally we get toB CC
, if we deleted duplicates normally we would have removed this already, however, we look up and see that because rows 2 and 3 were deleted thenB CC
is not a duplicate since there is none above it. Does this help? I can create a video maybe? – Claw