Filtering a matrix for unique combination in R

Asked 22/4 at 9:47 Answered 22/4 at 14:19

I'm trying to filter such that once a unique number from a pair of number is found starting from the top of the matrix, any subsequent pair entry is removed from the matrix leaving only the the filtered data.

T_data = c(7,9,8,10,2,10,5,9,1,8,2,1,4,7,5,4,2,5)

T_new = matrix(T_data,ncol=2,byrow=TRUE)

desired output: 7 9, 8 10, 2 1, 5 4

# Desired output
> matrix(c(7, 9, 8, 10, 2, 1, 5, 4), ncol = 2, byrow = TRUE)
     [,1] [,2]
[1,]    7    9
[2,]    8   10
[3,]    2    1
[4,]    5    4

I've tried writing my own loop but I presume there is a simple way to do this in R?

Incestuous answered 22/4 at 9:47 Comment(0)

You'll probably get the best performance with a well-designed for loop:

uniquemat <- function(x) {
  y <- array(match(c(x), u <- unique(c(x))), dim(x))
  u <- logical(length(u))
  k <- logical(nrow(x))
  u[y[1,]] <- k[1] <- TRUE
  for (i in 2:nrow(x)) if (!any(u[y[i,]])) u[y[i,]] <- k[i] <- TRUE
  x[k,]
}

uniquemat(T_data)
#>      [,1] [,2]
#> [1,]    7    9
#> [2,]    8   10
#> [3,]    2    1
#> [4,]    5    4

Benchmarking with a larger dataset:

fReduce <- function(x) { # from SamR
  Reduce(\(x,y) 
         if(any(x %in% y)) x else c(x, y), 
         asplit(x, 1)
  ) |> matrix(ncol = ncol(x), byrow = TRUE)
}

T_data <- matrix(sample(1e4, 1e4, 1), ncol = 2)

bench::mark(
  uniquemat = uniquemat(T_data),
  fReduce = fReduce(T_data)
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 uniquemat     4.3ms   4.68ms    191.       570KB     29.8
#> 2 fReduce     274.2ms 274.88ms      3.64     201MB     18.2

Using Reduce this way gets very slow for large matrices because x grows iteratively, which is bad. A final performance check on a matrix with 100K rows:

T_data <- matrix(sample(1e5, 1e5, 1), ncol = 2)

bench::mark(
  uniquemat = uniquemat(T_data),
  fReduce = fReduce(T_data)
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 uniquemat    86.5ms   99.8ms    7.41      5.18MB     14.8
#> 2 fReduce         24s      24s    0.0416   19.53GB     10.5

Mum answered 22/4 at 11:20 Comment(5)

third option added if you feel like updating the benchmark – Algebraic 22/4 at 11:57

Definitely a better approach if the data is large. Can we add a benchmark for the criteria in the question of not using a for loop, and writing idiomatic R code? :P – Bettencourt 22/4 at 12:0

@samR Agree that speed is not everything but I think the OP is not ruling out a loop. – Algebraic 22/4 at 14:19

@s_baldur, it appears that your solution is essentially the same as mine, except that yours assumes an integer matrix. If the matrix is integer and the maximum value is moderate, it would be faster to skip the match and use the values directly as indices--essentially what you did. – Mum 22/4 at 14:33

thanks, i was wondering on the efficiency of a standard looping approach. this is great. ty. – Incestuous 22/4 at 22:46

You have to use some form of iteration as you want a row to be excluded if (and only if) it contains a value which is included in your new matrix (rather than the original one).

You can do this with Reduce(), iterating over a list of rows created with asplit().

Reduce(\(x,y) 
    if(any(x %in% y)) x else c(x, y), 
    asplit(T_new, 1)
) |> matrix(ncol = ncol(T_new), byrow = TRUE)

#      [,1] [,2]
# [1,]    7    9
# [2,]    8   10
# [3,]    2    1
# [4,]    5    4

We can pipe this to matrix() to give it the dimensions of the input. This should work with a matrix with any dimensions.

Bettencourt answered 22/4 at 10:49 Comment(0)

You'll probably get the best performance with a well-designed for loop:

uniquemat <- function(x) {
  y <- array(match(c(x), u <- unique(c(x))), dim(x))
  u <- logical(length(u))
  k <- logical(nrow(x))
  u[y[1,]] <- k[1] <- TRUE
  for (i in 2:nrow(x)) if (!any(u[y[i,]])) u[y[i,]] <- k[i] <- TRUE
  x[k,]
}

uniquemat(T_data)
#>      [,1] [,2]
#> [1,]    7    9
#> [2,]    8   10
#> [3,]    2    1
#> [4,]    5    4

Benchmarking with a larger dataset:

fReduce <- function(x) { # from SamR
  Reduce(\(x,y) 
         if(any(x %in% y)) x else c(x, y), 
         asplit(x, 1)
  ) |> matrix(ncol = ncol(x), byrow = TRUE)
}

T_data <- matrix(sample(1e4, 1e4, 1), ncol = 2)

bench::mark(
  uniquemat = uniquemat(T_data),
  fReduce = fReduce(T_data)
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 uniquemat     4.3ms   4.68ms    191.       570KB     29.8
#> 2 fReduce     274.2ms 274.88ms      3.64     201MB     18.2

Using Reduce this way gets very slow for large matrices because x grows iteratively, which is bad. A final performance check on a matrix with 100K rows:

T_data <- matrix(sample(1e5, 1e5, 1), ncol = 2)

bench::mark(
  uniquemat = uniquemat(T_data),
  fReduce = fReduce(T_data)
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 uniquemat    86.5ms   99.8ms    7.41      5.18MB     14.8
#> 2 fReduce         24s      24s    0.0416   19.53GB     10.5

Mum answered 22/4 at 11:20 Comment(5)

third option added if you feel like updating the benchmark – Algebraic 22/4 at 11:57

Definitely a better approach if the data is large. Can we add a benchmark for the criteria in the question of not using a for loop, and writing idiomatic R code? :P – Bettencourt 22/4 at 12:0

@samR Agree that speed is not everything but I think the OP is not ruling out a loop. – Algebraic 22/4 at 14:19

thanks, i was wondering on the efficiency of a standard looping approach. this is great. ty. – Incestuous 22/4 at 22:46

It might not that efficient but would be fun if using recursions, e.g.,

f <- function(k = nrow(T_new)) {
    if (k == 1) {
        return(T_new[k, , drop = FALSE])
    }
    u <- Recall(k - 1)
    p <- T_new[k, , drop = FALSE]
    if (all(!p %in% c(u))) {
        return(rbind(u, p))
    } else {
        return(u)
    }
}

and you will see

> f()
     [,1] [,2]
[1,]    7    9
[2,]    8   10
[3,]    2    1
[4,]    5    4

Oakes answered 22/4 at 14:19 Comment(0)

Assuming the matrix contains only positive integers (as in the given example).

foo <- function(Tmat) {
  stopifnot(is.integer(Tmat), ncol(Tmat) == 2L)
  done             <- rep(FALSE, max(Tmat))
  done[Tmat[1L, ]] <- TRUE
  rows             <- rep(TRUE, nrow(Tmat))
  for (i in 2L:nrow(Tmat)) {
    if (any(done[Tmat[i, ]])) rows[i]         <- FALSE
    else                      done[Tmat[i, ]] <- TRUE
  }
  Tmat[which(rows), ]  
}

foo(T_new)

     [,1] [,2]
[1,]    7    9
[2,]    8   10
[3,]    2    1
[4,]    5    4

Algebraic answered 22/4 at 11:49 Comment(2)

This appears to assume Tmat is an integer matrix. Try foo(matrix(sample(runif(10), 20, 1), ncol = 2)). – Mum 22/4 at 12:51

That's correct. It assumes integer matrix of two columns but could be adapted depending on the structure of the data. – Algebraic 22/4 at 14:9

Recommended topics

Hot tags