As @TheUnfunCat pointed out, the OP's solution is pretty solid. The solution below is only moderately faster than the original solution. I tried almost every combination of base R
and couldn't beat the efficiency Rle
from the S4Vectors
package, thus I resorted to Rcpp
. Here is the main function :
GenomeRcpp <- function(v) {
x <- WhichDiffZero(v)
m <- v[c(1L,x+1L)]
s <- c(0L,x)
e <- c(x,length(v))-1L
GRanges('toyChr',IRanges(start = s, end = e), toyData = m)
}
The WhichDiffZero
is the Rcpp
function that pretty much does the exact same thing as which(diff(v) != 0)
in base R
. Much credit goes to @G.Grothendieck.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector WhichDiffZero(IntegerVector x) {
int nx = x.size()-1;
std::vector<int> y;
y.reserve(nx);
for(int i = 0; i < nx; i++) {
if (x[i] != x[i+1]) y.push_back(i+1);
}
return wrap(y);
}
Below are some benchmarks:
set.seed(437)
testData <- do.call(c,lapply(1:10^5, function(x) rep(sample(1:50, 1), sample(1:30, 1))))
microbenchmark(GenomeRcpp(testData), GenomeOrig(testData))
Unit: milliseconds
expr min lq mean median uq max neval cld
GenomeRcpp(testData) 20.30118 22.45121 26.59644 24.62041 27.28459 198.9773 100 a
GenomeOrig(testData) 25.11047 27.12811 31.73180 28.96914 32.16538 225.1727 100 a
identical(GenomeRcpp(testData), GenomeOrig(testData))
[1] TRUE
I've been working on this off and on for the past few days and I'm definitely not satisfied. I'm hoping that someone will take what I've done (as it is a different approach) and create something much better.
start(toyData)-1
to get the starts of the interval but it doesn't improve speed. – Uncharitablestart(toyData)-1
> – Eldestsample(1:15,10^8,replace=TRUE)
)? – CredentRle
object. It is theIRanges
/GRanges
function which is in general the slow part? – Phung