Efficiently match all values of a vector in another vector

Asked 1/6, 2023 at 14:4 Answered 2/6, 2023 at 20:53

Solved r performance optimization search matching

I'm looking to find an efficient method of matching all values of vector x in vector y rather than just the first position, as is returned by match(). What I'm after essentially is the default behavior of pmatch() but without partial matching:

x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)

Expected output:

pmatch(x, y)  
[1]  1  5 NA  2  3 NA

One way is to use ave() however this becomes slow and very memory inefficient as the number of groups increases:

ave(x, x, FUN = \(v) which(y == v[1])[1:length(v)])
[1]  1  5 NA  2  3 NA

Can anyone recommend an efficient way to achieve this in preferably (but not mandatory) base R?

Larger dataset for benchmarking:

set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)

Multinational answered 1/6, 2023 at 14:4 Comment(0)

A variant in base using split.
split the indices of both vectors by its value. Subset the second list with the names of the first, that both have the same order. Change NULL to NA and bring the lengths of the second list to those from the first. Reorder the indices of the second list by those of the first.

x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)

a <- split(seq_along(x), x)
b <- split(seq_along(y), y)[names(a)]
b[lengths(b)==0] <- NA
b <- unlist(Map(`length<-`, b, lengths(a)), FALSE, FALSE)
`[<-`(b, unlist(a, FALSE, FALSE), b)
#[1]  1  5 NA  2  3 NA

I tried to exchange the part

b <- split(seq_along(y), y)[names(a)]
b[lengths(b)==0] <- NA

with

b <- list2env(split(seq_along(y), y))
b <- mget(names(a), b, ifnotfound = NA)

But it was not faster.

An RCPP version.
Store the indices of the second vector ín a queue for each unique value in an unordered_map. Iterate over all values of the first vector and take the indices from the queue.

Rcpp::sourceCpp(code=r"(
#include <Rcpp.h>
#include <unordered_map>
#include <queue>

using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector pm(const std::vector<int>& a, const std::vector<int>& b) {
  IntegerVector idx(no_init(a.size()));
  std::unordered_map<int, std::queue<int> > lut;
  for(int i = 0; i < b.size(); ++i) lut[b[i]].push(i);
  for(int i = 0; i < idx.size(); ++i) {
    auto search = lut.find(a[i]);
    if(search != lut.end() && search->second.size() > 0) {
      idx[i] = search->second.front() + 1;
      search->second.pop();
    } else {idx[i] = NA_INTEGER;}
  }
  return idx;
}
)")
pm(x, y)
#[1]  1  5 NA  2  3 NA

A for this case specialized RCPP version.
Create a vector of the length of the maximum value of the first vector and count how many times a value is present. Create another queue vector of the same length and sore there the indices of the values of the second vector until it has reached the number of the first. Iterate over all values of the first vector and take the indices from the queue.

Rcpp::sourceCpp(code=r"(
#include <Rcpp.h>
#include <vector>
#include <array>
#include <queue>
#include <algorithm>

using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector pm2(const std::vector<int>& a, const std::vector<int>& b) {
  IntegerVector idx(no_init(a.size()));
  int max = 1 + *std::max_element(a.begin(), a.end());
  std::vector<int> n(max);
  for(int i = 0; i < a.size(); ++i) ++n[a[i]];
  std::vector<std::queue<int> > lut(max);
  for(int i = 0; i < b.size(); ++i) {
    if(b[i] < max && n[b[i]] > 0) {
      --n[b[i]];
      lut[b[i]].push(i);
    }
  }
  for(int i = 0; i < idx.size(); ++i) {
    auto & P = lut[a[i]];
    if(P.size() > 0) {
      idx[i] = P.front() + 1;
      P.pop();
    } else {idx[i] = NA_INTEGER;}
  }
  return idx;
}
)")
pm2(x,y)
#[1]  1  5 NA  2  3 NA

Benchmark

set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)

library(data.table)

matchall <- function(x, y) {
  data.table(y, rowid(y))[
    data.table(x, rowid(x)), on = .(y = x, V2), which = TRUE
  ]
}

rmatch <- function(x, y) {
  xp <- cbind(seq_along(x), x)[order(x),]
  yp <- cbind(seq_along(y), y)[order(y),]
  result <- numeric(length(x))
  
  xi <- yi <- 1
  Nx <- length(x)
  Ny <- length(y)
  while (xi <= Nx) {
    if (yi > Ny) {
      result[xp[xi,1]] <- NA
      xi <- xi + 1
    } else if (xp[xi,2] == yp[yi,2]) {
      result[xp[xi,1]] = yp[yi,1]
      xi <- xi + 1
      yi <- yi + 1
    } else if (xp[xi,2] < yp[yi,2]) {
      result[xp[xi,1]] <- NA
      xi <- xi + 1
    } else if (xp[xi,2] > yp[yi,2]) {
      yi <- yi + 1
    }
  }
  result  
}

bench::mark(
ave = ave(x, x, FUN = \(v) which(y == v[1])[1:length(v)]),
rmatch = rmatch(x, y),
make.name = match(make.names(x, TRUE), make.names(y, TRUE)),
paste = do.call(match, lapply(list(x, y), \(v) paste(v, ave(v, v, FUN = seq_along)))),
make.unique = match(make.unique(as.character(x)), make.unique(as.character(y))),
split = {a <- split(seq_along(x), x)
  b <- split(seq_along(y), y)[names(a)]
  b[lengths(b)==0] <- NA
  b <- unlist(Map(`length<-`, b, lengths(a)), FALSE, FALSE)
  `[<-`(b, unlist(a, FALSE, FALSE), b)},
data.table = matchall(x, y),
RCPP = pm(x, y),
RCPP2 = pm2(x, y)
)

Result

  expression       min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 ave            1.66s    1.66s     0.603    3.73GB    68.7      1   114
2 rmatch      258.29ms 259.35ms     3.86     5.34MB    30.8      2    16
3 make.name   155.69ms 156.82ms     6.37    14.06MB     1.59     4     1
4 paste         93.8ms 102.06ms     9.74    18.13MB     7.79     5     4
5 make.unique  81.67ms   92.8ms    10.4      9.49MB     5.22     6     3
6 split        12.66ms  13.16ms    65.8      7.18MB    16.0     33     8
7 data.table    6.22ms   6.89ms   114.       5.13MB    28.0     57    14
8 RCPP          3.06ms    3.2ms   301.     393.16KB     3.98   151     2
9 RCPP2         1.64ms   1.82ms   514.     393.16KB     8.00   257     4

In this case the C++ version is the fastest and allocates the lowest amount of memory. In case using base the splitB variant is the fastest and rmatch allocates the lowest amount of memory.

Isotone answered 1/6, 2023 at 20:26 Comment(2)

Thanks @GKi. I would have happily accepted any of the answers provided but your split option was the most efficient base solution. – Multinational 2/6, 2023 at 12:9

With some changes it could be made significant faster than my first post. Maybe there are other ways to do it even faster in base. – Isotone 4/6, 2023 at 5:22

Just to point out, you can use match + make.unique to accomplish the same. Speedwise, it might be slower than the data.table approach:

match(make.unique(as.character(x)), make.unique(as.character(y)))

[1]  1  5 NA  2  3 NA

match(make.names(x, TRUE), make.names(y, TRUE))
[1]  1  5 NA  2  3 NA

Ngo answered 1/6, 2023 at 15:41 Comment(0)

Using a data.table join, inspired by this Q&A.

library(data.table)

matchall <- function(x, y) {
  data.table(y, rowid(y))[
    data.table(x, rowid(x)), on = .(y = x, V2), which = TRUE
  ]
}

Check behavior

x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)

matchall(x, y)
#> [1]  1  5 NA  2  3 NA

Timing on larger vectors:

set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)

system.time(z1 <- matchall(x, y))
#>    user  system elapsed 
#>    0.06    0.00    0.01

system.time(z2 <- ave(x, x, FUN = \(v) which(y == v[1])[1:length(v)]))
#>    user  system elapsed 
#>    0.88    0.43    1.31

identical(z1, z2)
#> [1] TRUE

Virgil answered 1/6, 2023 at 14:51 Comment(0)

If you have some extra memory to spare, you can speed up the process by sorting the values and basically doing a two-pointer walk through to match up the data. Here's what what would look like

rmatch <- function(x, y) {
  xp <- cbind(seq_along(x), x)[order(x),]
  yp <- cbind(seq_along(y), y)[order(y),]
  result <- numeric(length(x))
  
  xi <- yi <- 1
  Nx <- length(x)
  Ny <- length(y)
  while (xi <= Nx) {
    if (yi > Ny) {
      result[xp[xi,1]] <- NA
      xi <- xi + 1
    } else if (xp[xi,2] == yp[yi,2]) {
      result[xp[xi,1]] = yp[yi,1]
      xi <- xi + 1
      yi <- yi + 1
    } else if (xp[xi,2] < yp[yi,2]) {
      result[xp[xi,1]] <- NA
      xi <- xi + 1
    } else if (xp[xi,2] > yp[yi,2]) {
      yi <- yi + 1
    }
  }
  result  
}

I tested with some of the other base R options posted here

mbm <- microbenchmark::microbenchmark(
  ave = ave(x, x, FUN = \(v) which(y == v[1])[1:length(v)]),
  rmatch = rmatch(x, y),
  pmatch = pmatch(x, y),
  times = 20
)

And saw that it seemed to perform well

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
    ave  1227.6743  1247.6980  1283.1024  1264.1485  1324.1569  1349.3276    20
 rmatch   198.1744   201.1058   208.3158   204.5933   209.4863   247.7279    20
 pmatch 39514.4227 39595.9720 39717.5887 39628.0892 39805.2405 40105.4337    20

These all return the same vector of values.

Chaldean answered 1/6, 2023 at 15:9 Comment(3)

Might be competitive with data.table with a C++ implementation? – Arbalest 1/6, 2023 at 15:31

Sure. But my goal was to just use base R, no dependencies (including system tools for compiling C++). Already the data.table approach is much faster because most of the work happens in the C++ back end. – Chaldean 1/6, 2023 at 15:34

@RitchieSacramento Thanks for the test case. I had an off-by-one error that I fixed. But I agree the split() method is a better alternative. – Chaldean 2/6, 2023 at 14:6

You can simply run match + paste + ave

> do.call(match, lapply(list(x, y), \(v) paste(v, ave(v, v, FUN = seq_along))))
[1]  1  5 NA  2  3 NA

Fanjet answered 2/6, 2023 at 20:53 Comment(0)

Recommended topics

Hot tags