A variant in base using split
.
split
the indices of both vectors by its value. Subset the second list with the names of the first, that both have the same order. Change NULL
to NA
and bring the lengths of the second list to those from the first. Reorder the indices of the second list by those of the first.
x <- c(3L, 1L, 2L, 3L, 3L, 2L)
y <- c(3L, 3L, 3L, 3L, 1L, 3L)
a <- split(seq_along(x), x)
b <- split(seq_along(y), y)[names(a)]
b[lengths(b)==0] <- NA
b <- unlist(Map(`length<-`, b, lengths(a)), FALSE, FALSE)
`[<-`(b, unlist(a, FALSE, FALSE), b)
#[1] 1 5 NA 2 3 NA
I tried to exchange the part
b <- split(seq_along(y), y)[names(a)]
b[lengths(b)==0] <- NA
with
b <- list2env(split(seq_along(y), y))
b <- mget(names(a), b, ifnotfound = NA)
But it was not faster.
An RCPP version.
Store the indices of the second vector ín a queue
for each unique value in an unordered_map
. Iterate over all values of the first vector and take the indices from the queue
.
Rcpp::sourceCpp(code=r"(
#include <Rcpp.h>
#include <unordered_map>
#include <queue>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector pm(const std::vector<int>& a, const std::vector<int>& b) {
IntegerVector idx(no_init(a.size()));
std::unordered_map<int, std::queue<int> > lut;
for(int i = 0; i < b.size(); ++i) lut[b[i]].push(i);
for(int i = 0; i < idx.size(); ++i) {
auto search = lut.find(a[i]);
if(search != lut.end() && search->second.size() > 0) {
idx[i] = search->second.front() + 1;
search->second.pop();
} else {idx[i] = NA_INTEGER;}
}
return idx;
}
)")
pm(x, y)
#[1] 1 5 NA 2 3 NA
A for this case specialized RCPP version.
Create a vector of the length of the maximum value of the first vector and count how many times a value is present. Create another queue
vector of the same length and sore there the indices of the values of the second vector until it has reached the number of the first. Iterate over all values of the first vector and take the indices from the queue
.
Rcpp::sourceCpp(code=r"(
#include <Rcpp.h>
#include <vector>
#include <array>
#include <queue>
#include <algorithm>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector pm2(const std::vector<int>& a, const std::vector<int>& b) {
IntegerVector idx(no_init(a.size()));
int max = 1 + *std::max_element(a.begin(), a.end());
std::vector<int> n(max);
for(int i = 0; i < a.size(); ++i) ++n[a[i]];
std::vector<std::queue<int> > lut(max);
for(int i = 0; i < b.size(); ++i) {
if(b[i] < max && n[b[i]] > 0) {
--n[b[i]];
lut[b[i]].push(i);
}
}
for(int i = 0; i < idx.size(); ++i) {
auto & P = lut[a[i]];
if(P.size() > 0) {
idx[i] = P.front() + 1;
P.pop();
} else {idx[i] = NA_INTEGER;}
}
return idx;
}
)")
pm2(x,y)
#[1] 1 5 NA 2 3 NA
Benchmark
set.seed(5)
x <- sample(5e3, 1e5, replace = TRUE)
y <- sample(x, replace = TRUE)
library(data.table)
matchall <- function(x, y) {
data.table(y, rowid(y))[
data.table(x, rowid(x)), on = .(y = x, V2), which = TRUE
]
}
rmatch <- function(x, y) {
xp <- cbind(seq_along(x), x)[order(x),]
yp <- cbind(seq_along(y), y)[order(y),]
result <- numeric(length(x))
xi <- yi <- 1
Nx <- length(x)
Ny <- length(y)
while (xi <= Nx) {
if (yi > Ny) {
result[xp[xi,1]] <- NA
xi <- xi + 1
} else if (xp[xi,2] == yp[yi,2]) {
result[xp[xi,1]] = yp[yi,1]
xi <- xi + 1
yi <- yi + 1
} else if (xp[xi,2] < yp[yi,2]) {
result[xp[xi,1]] <- NA
xi <- xi + 1
} else if (xp[xi,2] > yp[yi,2]) {
yi <- yi + 1
}
}
result
}
bench::mark(
ave = ave(x, x, FUN = \(v) which(y == v[1])[1:length(v)]),
rmatch = rmatch(x, y),
make.name = match(make.names(x, TRUE), make.names(y, TRUE)),
paste = do.call(match, lapply(list(x, y), \(v) paste(v, ave(v, v, FUN = seq_along)))),
make.unique = match(make.unique(as.character(x)), make.unique(as.character(y))),
split = {a <- split(seq_along(x), x)
b <- split(seq_along(y), y)[names(a)]
b[lengths(b)==0] <- NA
b <- unlist(Map(`length<-`, b, lengths(a)), FALSE, FALSE)
`[<-`(b, unlist(a, FALSE, FALSE), b)},
data.table = matchall(x, y),
RCPP = pm(x, y),
RCPP2 = pm2(x, y)
)
Result
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 ave 1.66s 1.66s 0.603 3.73GB 68.7 1 114
2 rmatch 258.29ms 259.35ms 3.86 5.34MB 30.8 2 16
3 make.name 155.69ms 156.82ms 6.37 14.06MB 1.59 4 1
4 paste 93.8ms 102.06ms 9.74 18.13MB 7.79 5 4
5 make.unique 81.67ms 92.8ms 10.4 9.49MB 5.22 6 3
6 split 12.66ms 13.16ms 65.8 7.18MB 16.0 33 8
7 data.table 6.22ms 6.89ms 114. 5.13MB 28.0 57 14
8 RCPP 3.06ms 3.2ms 301. 393.16KB 3.98 151 2
9 RCPP2 1.64ms 1.82ms 514. 393.16KB 8.00 257 4
In this case the C++ version is the fastest and allocates the lowest amount of memory. In case using base the splitB variant is the fastest and rmatch allocates the lowest amount of memory.
split
option was the most efficient base solution. – Multinational