Efficiently construct GRanges/IRanges from Rle vector - McMap

About

Efficiently construct GRanges/IRanges from Rle vector

Asked 30/8, 2016 at 8:54 Answered 19/9, 2016 at 9:28

r bioinformatics bioconductor run-length-encoding iranges

R

1

7

I have a Run length encoded vector representing some value at every position on the genome, in order. As a toy example suppose I had just one chromosome of length 10, then I would have a vector looking like

library(GenomicRanges)

set.seed(1)
toyData = Rle(sample(1:3,10,replace=TRUE))

I would like to coerce this into a GRanges object. The best I can come up with is

gr = GRanges('toyChr',IRanges(cumsum(c(0,runLength(toyData)[-nrun(toyData)])),
                              width=runLength(toyData)),
             toyData = runValue(toyData))

which works, but is quite slow. Is there a faster way to construct the same object?

Ridenhour answered 30/8, 2016 at 8:54 Comment(7)

you can use start(toyData)-1 to get the starts of the interval but it doesn't improve speed. – Uncharitable 13/9, 2016 at 14:46

@Uncharitable Thanks for tip, even if it's not faster it is much clearer and concise. – Ridenhour 15/9, 2016 at 13:59

<The whole cumsum can be replaced by start(toyData)-1> – Eldest 16/9, 2016 at 6:6

@user1356855, what are some typical chromosome lengths you would encounter? Also, would 3 be enough variation in your real-world application (e.g. could you have sample(1:15,10^8,replace=TRUE))? – Credent 19/9, 2016 at 8:56

@JosephWood Yeah, the values stored with genomic data are often real numbers so 3 would not be enough... But I would accept almost any answer. Longest genome: 247,249,719 for example? Chr1 humans... – Eldest 19/9, 2016 at 9:3

Could it be, that the problem is not transforming the Rle object. It is the IRanges/GRanges function which is in general the slow part? – Phung 19/9, 2016 at 10:5

@Jimbou, that is exactly what I'm thinking. I'm wondering how the folks at https://bioconductor.org/ would handle this. – Credent 19/9, 2016 at 12:10

C

4

As @TheUnfunCat pointed out, the OP's solution is pretty solid. The solution below is only moderately faster than the original solution. I tried almost every combination of base R and couldn't beat the efficiency Rle from the S4Vectors package, thus I resorted to Rcpp. Here is the main function :

GenomeRcpp <- function(v) {
    x <- WhichDiffZero(v)
    m <- v[c(1L,x+1L)]
    s <- c(0L,x)
    e <- c(x,length(v))-1L
    GRanges('toyChr',IRanges(start = s, end = e), toyData = m)
}

The WhichDiffZero is the Rcpp function that pretty much does the exact same thing as which(diff(v) != 0) in base R. Much credit goes to @G.Grothendieck.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector WhichDiffZero(IntegerVector x) {
    int nx = x.size()-1;
    std::vector<int> y;
    y.reserve(nx);
    for(int i = 0; i < nx; i++) {
        if (x[i] != x[i+1]) y.push_back(i+1);
    }
    return wrap(y);
}

Below are some benchmarks:

set.seed(437)
testData <- do.call(c,lapply(1:10^5, function(x) rep(sample(1:50, 1), sample(1:30, 1))))

microbenchmark(GenomeRcpp(testData), GenomeOrig(testData))
Unit: milliseconds
                expr      min       lq     mean   median       uq      max neval cld
GenomeRcpp(testData) 20.30118 22.45121 26.59644 24.62041 27.28459 198.9773   100   a
GenomeOrig(testData) 25.11047 27.12811 31.73180 28.96914 32.16538 225.1727   100   a

identical(GenomeRcpp(testData), GenomeOrig(testData))
[1] TRUE

I've been working on this off and on for the past few days and I'm definitely not satisfied. I'm hoping that someone will take what I've done (as it is a different approach) and create something much better.

Credent answered 19/9, 2016 at 9:28 Comment(2)

This might mean the OPs metadata contained non-vectorized data? Having objects in "vectors" is possible in pandas, dunno about R... – Eldest 19/9, 2016 at 9:52

I have to admit, I'm not completely satisfied either. It seems like the GRanges object is sufficiently similar to a Rle vector (for the one chromosome) that the construction step should be basically instant. Instead it's the slowest part of my code. Clearly I don't understand the internals well enough to know why this is wrong/how to make it faster. The Rcpp alternative is neat though and does give some extra speed. Thanks! – Ridenhour 19/9, 2016 at 15:1

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.