Find which column ranges overlap after grouping in R

Asked 5/1, 2023 at 16:29 Answered 6/1, 2023 at 1:3

Solved r dplyr data.table intervals genomicranges

I have a huge data frame that looks like this.

I want to group_by(chr), and then for each chr to find

Is any range1 (start1, end1), within the chr group, overlapping with any range2 (start2,end2)?

library(dplyr)

df1 <- tibble(chr=c(1,1,2,2),
               start1=c(100,200,100,200),
               end1=c(150,400,150,400),
       species=c("Penguin"), 
       start2=c(200,200,500,1000), 
       end2=c(250,240,1000,2000)
       )

df1
#> # A tibble: 4 × 6
#>     chr start1  end1 species start2  end2
#>   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl>
#> 1     1    100   150 Penguin    200   250
#> 2     1    200   400 Penguin    200   240
#> 3     2    100   150 Penguin    500  1000
#> 4     2    200   400 Penguin   1000  2000

^{Created on 2023-01-05 with reprex v2.0.2}

I want my data to look like this. Essentially I want to check if the range2 overlaps with any range1. The new data does not change the question, but proof checks the code

# A tibble: 4 × 6
        chr start1  end1 species start2  end2 OVERLAP
         1    100   150 Penguin    200   250    TRUE
         1    200   400 Penguin    200   240    TRUE
         2    100   150 Penguin    500  1000    FALSE
         2    200   400 Penguin   1000  2000    FALSE

I have fought a lot with the ivs package and iv_overlaps with no success in getting what I want.

Major EDIT:

When I apply any of the codes in real data, I am not getting the results I want, and I am so confused. Why? The new data dataset does not change the question, but proofs check the code

data <- tibble::tribble(
  ~chr, ~start1, ~end1, ~strand, ~gene, ~start2, ~end2,
  "Chr2",   2739,   2840, "+", "A",    740,   1739,
  "Chr2",  12577,  12678, "+", "B",  10578,  11577,
  "Chr2",  22431,  22532, "+", "C",  20432,  21431,
  "Chr2",  32202,  32303, "+", "D",  30203,  31202,
  "Chr2",  42024,  42125, "+", "E",  40025,  41024,
  "Chr2",  51830,  51931, "+", "F",  49831,  50830,
  "Chr2",  82061,  84742, "+", "G",  80062,  81061,
  "Chr2",  84811,  86692, "+", "H",  82812,  83811,
  "Chr2",  86782,  88106, "-", "I",  88107,  89106,
  "Chr2", 139454, 139555, "+", "J", 137455, 138454,
  )

data %>% 
  group_by(chr) %>% 
  mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))

then It gives as an output

 chr   start1   end1 strand gene  start2   end2 overlap
   <chr>  <dbl>  <dbl> <chr>  <chr>  <dbl>  <dbl> <lgl>  
 1 Chr2    2739   2840 +      A        740   1739 TRUE   
 2 Chr2   12577  12678 +      B      10578  11577 TRUE   
 3 Chr2   22431  22532 +      C      20432  21431 TRUE   
 4 Chr2   32202  32303 +      D      30203  31202 TRUE   
 5 Chr2   42024  42125 +      E      40025  41024 TRUE   
 6 Chr2   51830  51931 +      F      49831  50830 TRUE   
 7 Chr2   82061  84742 +      G      80062  81061 TRUE   
 8 Chr2   84811  86692 +      H      82812  83811 TRUE   
 9 Chr2   86782  88106 -      I      88107  89106 TRUE   
10 Chr2  139454 139555 +      J     137455 138454 TRUE

Which is wrong. They might be indirect matches, but there there is not a direct overlap.

Stamina answered 5/1, 2023 at 16:29 Comment(8)

I think it depends on what you mean by overlap. Check the type argument of iv_overlaps. Can you clarify what you mean by overlap? For instance in your new data, Row 7 overlaps with row 8, at least partially, which is why iv_overlaps return true. Will update my answer accordingly. – Delay 5/1, 2023 at 22:18

Indeed only one-row overlaps. But the rest do not. That's a great catch Maël. So ideally, I should have 9 rows with FALSE and one with TRUE – Stamina 5/1, 2023 at 22:21

Right. Try my code without any then. – Delay 5/1, 2023 at 22:27

@Stamina if you remove the outer any() of @Maël's or my answers, you can get 9 rows with FALSE and one with TRUE with the new data, is that what you want? But with that rule your first dataset will get (F, T, F, F). Obviously the two rules in the old and new datasets are not consistent. – Beore 6/1, 2023 at 2:26

You are right, Darren, but there should be consistent. Theoretically, its something so easy. I want to check if any range2 overlaps with any range1. With any I get overlaps that do not exist. It might give a direct outcome to the small dataset, but that does not mean that it is correct. The rule is the same in both datasets. The new dataset does not introduce a new question. It only checks the code – Stamina 6/1, 2023 at 9:22

The main issue was that different interpretations of your question could lead to the same results, which is why when you tried on a different scenario it did not work any more. But now I think we've understood – Delay 6/1, 2023 at 9:51

You wanted to know if any range2 overlapped with any range1 by the chr column, so shouldn't your second example just be Chr2 -> TRUE? – Crosscountry 6/1, 2023 at 10:35

in the second example, there should be all FALSE except of one that is TRUE. good point – Stamina 6/1, 2023 at 10:38

There are several interpretations to your questions, so here are three possible cases:

Within a group, detect for each [start1, end1] if they overlap with any of [start2, end2].
Within a group, detect if any of [start1, end1] overlap with any of [start2, end2].
Within a group, detect if each of [start1, end1] overlap with their corresponding [start2, end2] (the one on the same row).

In the three cases, you can use ivs::iv_overlaps.

Case 1

iv_overlaps will detect, within each group, if the intervals defined in [start1, end1] overlap in any way with any of the intervals [start2, end2]. It'll return a logical vector of the length of [start1, end1].

library(ivs)
library(dplyr)
df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = iv_overlaps(iv(start1, end1), iv(start2, end2)))

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 FALSE  
2     1    200   400 Penguin    160   170 TRUE   
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE

Case 2

If you want to know if any (not each) of the intervals 1 overlaps with any of the intervals 2 (so one unique value per group), you should add any:

df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 TRUE   
2     1    200   400 Penguin    160   170 TRUE   
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE

Case 3

If you want rowwise overlap detection, then you should use map2 with iv_overlaps:

df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = map2_lgl(iv(start1, end1), iv(start2, end2), iv_overlaps))

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 FALSE  
2     1    200   400 Penguin    160   170 FALSE  
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE

Order of the comparison

Indeed, if one wants to compare the second intervals with the first, one should use iv_overlaps(interval2, interval1):

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 TRUE   
2     1    200   400 Penguin    160   170 FALSE  
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE

Data

df1 <- tibble(chr=c(1,1,2,2),               start1=c(100,200,100,200),               end1=c(150,400,150,400),               species=c("Penguin"),                start2=c(200,160,500,1000),                end2=c(250,170,1000,2000) )

Benedikta answered 5/1, 2023 at 16:45 Comment(10)

Thank you, Maël. I have put a piece of real data. It seems that in my computer the code does not work for me. I dunno why. Clearly, there are no overlaps and give an overlap. Thank you so so much for taking the time – Stamina 5/1, 2023 at 20:58

Thank you, Maël. I think it still does not work universally with this code. If you have a dataset like this

df1 <- tibble(chr=c(1,1,2,2),               start1=c(100,200,100,200),               end1=c(150,400,150,400),               species=c("Penguin"),                start2=c(200,160,500,1000),                end2=c(250,170,1000,2000) )

it gives you an overlap when there is no overlap – Stamina 6/1, 2023 at 9:51

You mean in row 2? there is overlap between row 2 interval 1 and row 1 interval 2. you don't want to detect overlap even in that case? so only rowwise? Check my last edit – Delay 6/1, 2023 at 9:53

I am sorry that I was not clear. It should not be row-wise. We have range2, and then we want to compare each range2 with each range1. If there is overlap, then its TRUE. We want to compare each row of range2, with all rows of range1 and then drive an outcome – Stamina 6/1, 2023 at 10:4

Can you explain to me why you said 'it gives you an overlap when there is no overlap'? what would be the expected outcome here? – Delay 6/1, 2023 at 10:7

If I understand now, you just need to inverse the intervals in iv_overlaps: df1 %>% group_by(chr) %>% mutate(overlap = iv_overlaps(iv(start2, end2), iv(start1, end1))) – Delay 6/1, 2023 at 10:7

you are right. For example in Chr1, you would expect the 200-250 (range 2) interval to have an overlap with the (range1) 200-400. It should TRUE,FALSE,FALSE,FALSE – Stamina 6/1, 2023 at 10:11

Great, then inverse the interval is the solution. – Delay 6/1, 2023 at 10:12

and then at the same time this need to work also for the data. I really admire you Maël for keeping up with this. You are great – Stamina 6/1, 2023 at 10:12

wow thats quite an investigation in here – Baily 6/1, 2023 at 10:18

Scenario 1: Element-wise detection for overlapping

library(dplyr)

df1 %>%
  group_by(chr) %>%
  mutate(OVERLAP = any(start1 <= end2 & end1 >= start2)) %>%
  ungroup()

# # A tibble: 4 × 7
#     chr start1  end1 species start2  end2 OVERLAP
#   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
# 1     1    100   150 Penguin    200   250 TRUE   
# 2     1    200   400 Penguin    200   240 TRUE   
# 3     2    100   150 Penguin    500  1000 FALSE  
# 4     2    200   400 Penguin   1000  2000 FALSE

Scenario 2: Element-wise detection for overlapping with sorting

If the intervals are directed, i.e. end can be less than start, then you need to do sorting before determine overlaps.

df1 %>%
  group_by(chr) %>%
  mutate(OVERLAP = any(pmin(start1, end1) <= pmax(start2, end2) &
                       pmax(start1, end1) >= pmin(start2, end2)))

Scenario 3: Cross detection for overlapping with sorting

Furthermore, if you want to check if an interval (start1, end1) overlaps any of the intervals (start2, end2), as which ivs::iv_overlaps() works, then you can implement it with purrr::map2.

df1 %>%
  group_by(chr) %>%
  mutate(OVERLAP = any(
    purrr::map2_lgl(start1, end1,
                    ~ any(min(.x, .y) <= pmax(start2, end2) &
                          max(.x, .y) >= pmin(start2, end2)))
  ))

Beore answered 5/1, 2023 at 16:36 Comment(4)

Will this work even if the overlap is on different column? – Delay 5/1, 2023 at 16:37

I checked and it does not work on the specific case. This depends on what OPs means by any – Delay 5/1, 2023 at 16:54

@Benedikta Thanks. Yes this depends on OP's request. If he want the effect of ivs::iv_overlaps(), I provide an alternative to achieve it. – Beore 5/1, 2023 at 19:25

Thank you so much Darren for having a look. I tested the code in real data and did not work for me. I posted some real data in the question. Could you also have a look – Stamina 5/1, 2023 at 20:59