Remove duplicates in two ungrouped columns from top to bottom
Asked Answered
C

4

13

Here is an example data frame to show my problem and what I want to achieve. Here I have two columns, x and y, that I want to remove duplicates from. I also have column z that contains the a sorted rank of the rows.

  x   y   z
1 A   BB  8
2 B   BB  7.5
3 B   AA  6.2
4 B   CC  5
5 C   DD  4
6 D   CC  3

I am trying to look at both x and y at the same time and every time there is a duplicate in either column then delete the row and keep going.

The end result I'm looking for is

  x   y   z
1 A   BB  8
3 B   AA  6.2
5 C   DD  4
6 D   CC  3

The second BB in column y is removed. Then the B - AA row is not removed since going down row by row it is now the first B in the x column. This is for a large dataset so unfortunately I can not do it by hand.

I am not trying to group the two columns together. I also do not want to delete duplicates one column at a time either since if that was done then it would remove too many observations.

How can this be achieved?

Claw answered 18/12, 2023 at 15:40 Comment(2)
In your example, x is "B" in rows 2,3, and 4. If you sort by x first, rows 3 and 4 will disappear. But if you sort by y first, row 3 will remain. What do you really want? For example, do you want to simultaneously check x[j] vs x[j+1] and y[j] vs y[j+1] , removing the j+1 -th row if either is a dupe?Predominance
@CarlWitthoft I'm having trouble trying to frame this as a formula problem, I'm sorry. My thought process is that I go through each row 1 by 1. So first I see A BB at the top so it's fine. Then B BB and notice that this is the second BB so I delete the row. Next B AA and see this is the second B and delete that row. Finally we get to B CC, if we deleted duplicates normally we would have removed this already, however, we look up and see that because rows 2 and 3 were deleted then B CC is not a duplicate since there is none above it. Does this help? I can create a video maybe?Claw
C
0

This solved my problem.

remove_top_duplicate <- function(tempdf) {
  tempdf <- tempdf %>% 
    group_by(x) %>% 
    mutate(xrep = row_number()) %>% 
    ungroup() %>% 
    group_by(y) %>% 
    mutate(yrep = row_number()) %>% 
    ungroup()

  if(!(any(tempdf$xrep > 1)) & !(any(tempdf$yrep > 1))){
    break
  }

  tempdf <- slice(tempdf, -which(xrep > 1 | yrep > 1)[1])
  return(tempdf)
}

while(TRUE){
  df <- remove_top_duplicate(df)
  if(!(any(df$xrep > 1)) & !(any(df$yrep > 1))){
    print("Finished")
    break
  }
}

I created a function that takes in a date frame and will mutate a row_numbers() column first for x and then y. This way we have two columns called xrep and yrep where we can find the top most duplicate. Then I just slice the first row found that is greater than 1 and then I return the data frame. Since this is a function I just run it in a while(TRUE) situation and just wait till there are no 2's in the repetition columns which is when we know there are no more duplicates.

Here is an example of what it looks like after the first run. It will then slice row two because of the 2 in the yrep column. After that it returns the data frame then repeats the process recalculating it again. Then slicing it again.

  x     y       z     xrep  yrep
1 A     BB      8       1     1
2 B     BB      7.5     1     2
3 B     AA      6.2     2     1
4 B     CC      5       3     1
5 C     DD      4       1     1
6 D     CC      3       1     2

This is not a clean solution but it gets the work done.

Claw answered 19/12, 2023 at 15:10 Comment(1)
Hi. Please avoid social & meta commentary in posts. The current 1st line would be noise to edit out but it is helpful to let people know when an answer is by the asker in case they don't notice the user. I left in the last line because it has some vague value. But neither "better" nor "prettier" mean anything in particular. PS When writing is chopped down to non-redundant non-social non-noise it can seem odd, but that shows that maybe it hadn't been saying what the author thought.Riess
P
6

As we need to check both columns simultaneously I doubt we can sequentially check x and y columns using duplicated() so here is my offer slower but possibly more reliable:

  i <- 2
  repeat {
    row_removed <- F
    if(df[i,]$x %in% df[1:(i-1),"x"]) {
      df <- df[-i,]
      row_removed <- T
    }
    if (i>nrow(df)) break
    if(df[i,]$y %in% df[1:(i-1),"y"]) {
      df <- df[-i,]
      row_removed <- T
    }
    if (!row_removed) i <- i + 1
    if (i>nrow(df)) break
  }

Result:

  x  y   z
1 A BB 8.0
3 B AA 6.2
5 C DD 4.0

For the edited dataset it is also returning result expected by OP:

  x  y   z
1 A BB 8.0
3 B AA 6.2
5 C DD 4.0
6 D CC 3.0
Pentecostal answered 18/12, 2023 at 16:34 Comment(3)
I think you're right. I was struggling to try and get this concept done in dplyr so maybe it must be done in some sort of loop to constantly check if it's a duplicate after other rows have been deleted. This will work for my case then. I just wish there was a simpler solution. I thought this problem would be more popular to warrant a function/method in dplyr or something.Claw
Tried getting to work but wasn't getting the correct answer for the edit that you putClaw
@Claw very strange I have just retried and it is exactly as I posted. I used this df <- structure(list(x = c("A", "B", "B", "B", "C", "D"), y = c("BB", "BB", "AA", "CC", "DD", "CC"), z = c(8, 7.5, 6.2, 5, 4, 3)), class = "data.frame", row.names = c(NA, -6L))Pentecostal
K
4

It seems you need to iteratively check the latest row and decide if it should be kept or removed. In that case, you can try Reduce like below, where the row index updated iteratively:

df[!duplicated(
  Reduce(\(p, q) ifelse(any(df[p, c(1, 2)] == df[q, c(1, 2)]), p, q),
    seq_len(nrow(df)),
    accumulate = TRUE
  )
), ]

or

df[!duplicated(
  Reduce(\(...) rev(c(...))[1 + any(sapply(df[c(...), c(1, 2)], duplicated))],
    seq_len(nrow(df)),
    accumulate = TRUE
  )
), ]

which gives

  x  y   z
1 A BB 8.0
3 B AA 6.2
5 C DD 4.0
6 D CC 3.0

Data

structure(list(x = c("A", "B", "B", "B", "C", "D"), y = c("BB",
"BB", "AA", "CC", "DD", "CC"), z = c(8, 7.5, 6.2, 5, 4, 3)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Kasey answered 18/12, 2023 at 23:47 Comment(0)
B
2

Apparently you want to iteratively remove consecutive duplicates in first two columns, starting with the second column. We can try it with a repeat loop that iterates between second and first column until no more dupes are left:

> cs <- 1:2
> j <- 2
> repeat {
+   dat <- dat[-which.max(cumsum(duplicated(dat[, j])) == 1), ]
+   if (!any(sapply(dat[cs], duplicated))) break
+   if (!j %% 2) j <- 1 else j <- 2
+ }
> dat
  x  y   z
1 A BB 8.0
3 B AA 6.2
5 C DD 4.0
6 D CC 3.0

Note, I've had added an E -- CC row that was successfully removed. You may test this further.


Data:

> dput(dat)
structure(list(x = c("A", "B", "B", "B", "C", "D", "E"), y = c("BB", 
"BB", "AA", "CC", "DD", "CC", "CC"), z = c(8, 7.5, 6.2, 5, 4, 
3, 3)), class = "data.frame", row.names = c(NA, -7L))
Brant answered 18/12, 2023 at 17:17 Comment(1)
This one didn't work on my larger data set. Does this delete duplicates in one column fully? If it does then it already won't work.Claw
C
0

This solved my problem.

remove_top_duplicate <- function(tempdf) {
  tempdf <- tempdf %>% 
    group_by(x) %>% 
    mutate(xrep = row_number()) %>% 
    ungroup() %>% 
    group_by(y) %>% 
    mutate(yrep = row_number()) %>% 
    ungroup()

  if(!(any(tempdf$xrep > 1)) & !(any(tempdf$yrep > 1))){
    break
  }

  tempdf <- slice(tempdf, -which(xrep > 1 | yrep > 1)[1])
  return(tempdf)
}

while(TRUE){
  df <- remove_top_duplicate(df)
  if(!(any(df$xrep > 1)) & !(any(df$yrep > 1))){
    print("Finished")
    break
  }
}

I created a function that takes in a date frame and will mutate a row_numbers() column first for x and then y. This way we have two columns called xrep and yrep where we can find the top most duplicate. Then I just slice the first row found that is greater than 1 and then I return the data frame. Since this is a function I just run it in a while(TRUE) situation and just wait till there are no 2's in the repetition columns which is when we know there are no more duplicates.

Here is an example of what it looks like after the first run. It will then slice row two because of the 2 in the yrep column. After that it returns the data frame then repeats the process recalculating it again. Then slicing it again.

  x     y       z     xrep  yrep
1 A     BB      8       1     1
2 B     BB      7.5     1     2
3 B     AA      6.2     2     1
4 B     CC      5       3     1
5 C     DD      4       1     1
6 D     CC      3       1     2

This is not a clean solution but it gets the work done.

Claw answered 19/12, 2023 at 15:10 Comment(1)
Hi. Please avoid social & meta commentary in posts. The current 1st line would be noise to edit out but it is helpful to let people know when an answer is by the asker in case they don't notice the user. I left in the last line because it has some vague value. But neither "better" nor "prettier" mean anything in particular. PS When writing is chopped down to non-redundant non-social non-noise it can seem odd, but that shows that maybe it hadn't been saying what the author thought.Riess

© 2022 - 2025 — McMap. All rights reserved.