Efficiently construct GRanges/IRanges from Rle vector
Asked Answered
R

1

7

I have a Run length encoded vector representing some value at every position on the genome, in order. As a toy example suppose I had just one chromosome of length 10, then I would have a vector looking like

library(GenomicRanges)

set.seed(1)
toyData = Rle(sample(1:3,10,replace=TRUE))

I would like to coerce this into a GRanges object. The best I can come up with is

gr = GRanges('toyChr',IRanges(cumsum(c(0,runLength(toyData)[-nrun(toyData)])),
                              width=runLength(toyData)),
             toyData = runValue(toyData))

which works, but is quite slow. Is there a faster way to construct the same object?

Ridenhour answered 30/8, 2016 at 8:54 Comment(7)
you can use start(toyData)-1 to get the starts of the interval but it doesn't improve speed.Uncharitable
@Uncharitable Thanks for tip, even if it's not faster it is much clearer and concise.Ridenhour
<The whole cumsum can be replaced by start(toyData)-1>Eldest
@user1356855, what are some typical chromosome lengths you would encounter? Also, would 3 be enough variation in your real-world application (e.g. could you have sample(1:15,10^8,replace=TRUE))?Credent
@JosephWood Yeah, the values stored with genomic data are often real numbers so 3 would not be enough... But I would accept almost any answer. Longest genome: 247,249,719 for example? Chr1 humans...Eldest
Could it be, that the problem is not transforming the Rle object. It is the IRanges/GRanges function which is in general the slow part?Phung
@Jimbou, that is exactly what I'm thinking. I'm wondering how the folks at https://bioconductor.org/ would handle this.Credent
C
4

As @TheUnfunCat pointed out, the OP's solution is pretty solid. The solution below is only moderately faster than the original solution. I tried almost every combination of base R and couldn't beat the efficiency Rle from the S4Vectors package, thus I resorted to Rcpp. Here is the main function :

GenomeRcpp <- function(v) {
    x <- WhichDiffZero(v)
    m <- v[c(1L,x+1L)]
    s <- c(0L,x)
    e <- c(x,length(v))-1L
    GRanges('toyChr',IRanges(start = s, end = e), toyData = m)
}

The WhichDiffZero is the Rcpp function that pretty much does the exact same thing as which(diff(v) != 0) in base R. Much credit goes to @G.Grothendieck.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector WhichDiffZero(IntegerVector x) {
    int nx = x.size()-1;
    std::vector<int> y;
    y.reserve(nx);
    for(int i = 0; i < nx; i++) {
        if (x[i] != x[i+1]) y.push_back(i+1);
    }
    return wrap(y);
}

Below are some benchmarks:

set.seed(437)
testData <- do.call(c,lapply(1:10^5, function(x) rep(sample(1:50, 1), sample(1:30, 1))))

microbenchmark(GenomeRcpp(testData), GenomeOrig(testData))
Unit: milliseconds
                expr      min       lq     mean   median       uq      max neval cld
GenomeRcpp(testData) 20.30118 22.45121 26.59644 24.62041 27.28459 198.9773   100   a
GenomeOrig(testData) 25.11047 27.12811 31.73180 28.96914 32.16538 225.1727   100   a

identical(GenomeRcpp(testData), GenomeOrig(testData))
[1] TRUE

I've been working on this off and on for the past few days and I'm definitely not satisfied. I'm hoping that someone will take what I've done (as it is a different approach) and create something much better.

Credent answered 19/9, 2016 at 9:28 Comment(2)
This might mean the OPs metadata contained non-vectorized data? Having objects in "vectors" is possible in pandas, dunno about R...Eldest
I have to admit, I'm not completely satisfied either. It seems like the GRanges object is sufficiently similar to a Rle vector (for the one chromosome) that the construction step should be basically instant. Instead it's the slowest part of my code. Clearly I don't understand the internals well enough to know why this is wrong/how to make it faster. The Rcpp alternative is neat though and does give some extra speed. Thanks!Ridenhour

© 2022 - 2024 — McMap. All rights reserved.