identify consecutively overlapping segments in R

Asked 30/8, 2018 at 7:56 Answered 30/8, 2018 at 8:49

Solved r data.table grouping overlap locf

I need to aggregate overlapping segments into a single segment ranging all connected segments.

Note that a simple foverlaps cannot detect connections between non overlapping but connected segments, see the example for clarification. If it would rain on my segments in my plot I am looking for the stretches of dry ground.

So far I solve this problem by an iterative algorithm but I'm wondering if there is a more elegant and stright forward way for this problem. I'm sure not the first one to face it.

I was thinking about a non-equi rolling join, but faild to implement that

library(data.table)
(x <- data.table(start = c(41,43,43,47,47,48,51,52,54,55,57,59), 
                  end = c(42,44,45,53,48,50,52,55,57,56,58,60)))

#     start end
#  1:    41  42
#  2:    43  44
#  3:    43  45
#  4:    47  53
#  5:    47  48
#  6:    48  50
#  7:    51  52
#  8:    52  55
#  9:    54  57
# 10:    55  56
# 11:    57  58
# 12:    59  60

setorder(x, start)[, i := .I] # i is just a helper for plotting segments
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))

x$grp <- c(1,3,3,2,2,2,2,2,2,2,2,4) # the grouping I am looking for
do.call(segments, list(x$start, x$i, x$end, x$i, col = x$grp))
(y <- x[, .(start = min(start), end = max(end)), k=grp])

#    grp start end
# 1:   1    41  42
# 2:   2    47  58
# 3:   3    43  45
# 4:   4    59  60

do.call(segments, list(y$start, 12.2, y$end, 12.2, col = 1:4, lwd = 3))

EDIT:

That's brilliant, thanks, cummax & cumsum do the job, Uwe's Answer is slightly better than Davids comment.

end[.N] can get wrong results, try example data x below. max(end) is correct in all cases, and faster.

x <- data.table(start = c(11866, 12696, 13813, 14011, 14041), end = c(13140, 14045, 14051, 14039, 14045))
min(start) and start[1L] give the same (as x is ordered by start), the latter is faster.
grp on the fly is significantly faster, unfortunately I need grp assigned.
cumsum(cummax(shift(end, fill = 0)) < start) is significantly faster than cumsum(c(0, start[-1L] > cummax(head(end, -1L)))).
I did not test the package GenomicRanges solution.

Cottrell answered 30/8, 2018 at 7:56 Comment(1)

x[, .(start[1L], end[.N]), by = .(grp = cumsum(c(0, start[-1L] > cummax(head(end, -1L)))))] could work. Basically a data.table version of my solution here – Fonzie 30/8, 2018 at 8:46

The OP has requested to aggregate overlapping segments into a single segment ranging all connected segments.

Here is another solution which uses cummax() and cumsum() to identify groups of overlapping or adjacent segments:

x[order(start, end), grp := cumsum(cummax(shift(end, fill = 0)) < start)][
  , .(start = min(start), end = max(end)), by = grp]

   grp start end
1:   1    41  42
2:   2    43  45
3:   3    47  58
4:   4    59  60

Disclaimer: I have seen that clever approach somewhere else on SO but I cannot remember exactly where.

Edit:

As David Arenburg has pointed out, it is not necessary to create the grp variable separately. This can be done on-the-fly in the by = parameter:

x[order(start, end), .(start = min(start), end = max(end)), 
  by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))]

Visualisation

OP's plot can be amended to show also the aggregated segments (quick and dirty):

plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x[order(start, end), .(start = min(start), end = max(end)), 
  by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))][
    , segments(start, grp + 0.5, end, grp + 0.5, "red", , 4)]

Gordan answered 30/8, 2018 at 8:49 Comment(5)

You don't need to create grp, you can pass it directly into by. – Fonzie 30/8, 2018 at 9:8

@DavidArenburg there is no grp provided, and that is the main aim of the question. Creating the group – Gentianella 30/8, 2018 at 9:10

You mean OP wants the grp in their original data? Perhaps you are right – Fonzie 30/8, 2018 at 9:11

@Gentianella The OP has requested to aggregate overlapping segments into a single segment ranging all connected segments. – Gordan 30/8, 2018 at 9:16

I get it.. Although just saw OP wrote the grouping I am looking for. its still answers the solution – Gentianella 30/8, 2018 at 9:22

You can try a GenomicRanges approach. In the output each row is a group.

library(GenomicRanges)
x_gr <- with(x, GRanges(1, IRanges(start, end)))
as.data.table(reduce(x_gr, min.gapwidth=0))[,2:3]
   start end
1:    41  42
2:    43  45
3:    47  58
4:    59  60

And a visual insepection can be done using Gviz. Here one has to know that the package has been built for biologists and genetic information. The pattern behind are DNA bases. Hence, one has to substract 1 of the segment ends to get the correct plot.

library(Gviz)
ga <- Gviz::GenomeAxisTrack()
xgr <- with(x, GRanges(1, IRanges(start, end = end - 1)))
xgr_red <- reduce(xgr, min.gapwidth=1)
ga <- GenomeAxisTrack()
GT <- lapply(xgr, GeneRegionTrack)
GT_red <- lapply(xgr_red, GeneRegionTrack, fill = "lightblue")
plotTracks(c(ga, GT, GT_red),from = min(x$start), to = max(x$start)+2)

Noman answered 30/8, 2018 at 8:32 Comment(6)

The group is what exactly he is looking for. – Gentianella 30/8, 2018 at 8:46

in your edit you are still using x$grp=..... This is a manual input of the groups. It should not be the case – Gentianella 30/8, 2018 at 9:9

@Jimbou your very first statement cleary states #add the grouping But we do not have the grouping.. we are supposed to create/obtain the grouping. see the answer provided by Uwe – Gentianella 30/8, 2018 at 9:12

@Gentianella I deleted my first answer. Now you can only pass start and end, then you get the expected result. Happy with that? – Noman 30/8, 2018 at 9:17

This tackles the question. Love it – Gentianella 30/8, 2018 at 9:24

@Jimbou Great! Thank you for updating the chart. I will delete my comments as they are no longer needed. – Gordan 30/8, 2018 at 10:0

Visualisation

Recommended topics

Hot tags