Find which column ranges overlap after grouping in R
Asked Answered
S

4

6

I have a huge data frame that looks like this.

I want to group_by(chr), and then for each chr to find

  • Is any range1 (start1, end1), within the chr group, overlapping with any range2 (start2,end2)?
library(dplyr)

df1 <- tibble(chr=c(1,1,2,2),
               start1=c(100,200,100,200),
               end1=c(150,400,150,400),
       species=c("Penguin"), 
       start2=c(200,200,500,1000), 
       end2=c(250,240,1000,2000)
       )

df1
#> # A tibble: 4 × 6
#>     chr start1  end1 species start2  end2
#>   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl>
#> 1     1    100   150 Penguin    200   250
#> 2     1    200   400 Penguin    200   240
#> 3     2    100   150 Penguin    500  1000
#> 4     2    200   400 Penguin   1000  2000

Created on 2023-01-05 with reprex v2.0.2

I want my data to look like this. Essentially I want to check if the range2 overlaps with any range1. The new data does not change the question, but proof checks the code

# A tibble: 4 × 6
        chr start1  end1 species start2  end2 OVERLAP
         1    100   150 Penguin    200   250    TRUE
         1    200   400 Penguin    200   240    TRUE
         2    100   150 Penguin    500  1000    FALSE
         2    200   400 Penguin   1000  2000    FALSE

I have fought a lot with the ivs package and iv_overlaps with no success in getting what I want.

Major EDIT:


When I apply any of the codes in real data, I am not getting the results I want, and I am so confused. Why? The new data dataset does not change the question, but proofs check the code

data <- tibble::tribble(
  ~chr, ~start1, ~end1, ~strand, ~gene, ~start2, ~end2,
  "Chr2",   2739,   2840, "+", "A",    740,   1739,
  "Chr2",  12577,  12678, "+", "B",  10578,  11577,
  "Chr2",  22431,  22532, "+", "C",  20432,  21431,
  "Chr2",  32202,  32303, "+", "D",  30203,  31202,
  "Chr2",  42024,  42125, "+", "E",  40025,  41024,
  "Chr2",  51830,  51931, "+", "F",  49831,  50830,
  "Chr2",  82061,  84742, "+", "G",  80062,  81061,
  "Chr2",  84811,  86692, "+", "H",  82812,  83811,
  "Chr2",  86782,  88106, "-", "I",  88107,  89106,
  "Chr2", 139454, 139555, "+", "J", 137455, 138454,
  )

data %>% 
  group_by(chr) %>% 
  mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))

then It gives as an output

 chr   start1   end1 strand gene  start2   end2 overlap
   <chr>  <dbl>  <dbl> <chr>  <chr>  <dbl>  <dbl> <lgl>  
 1 Chr2    2739   2840 +      A        740   1739 TRUE   
 2 Chr2   12577  12678 +      B      10578  11577 TRUE   
 3 Chr2   22431  22532 +      C      20432  21431 TRUE   
 4 Chr2   32202  32303 +      D      30203  31202 TRUE   
 5 Chr2   42024  42125 +      E      40025  41024 TRUE   
 6 Chr2   51830  51931 +      F      49831  50830 TRUE   
 7 Chr2   82061  84742 +      G      80062  81061 TRUE   
 8 Chr2   84811  86692 +      H      82812  83811 TRUE   
 9 Chr2   86782  88106 -      I      88107  89106 TRUE   
10 Chr2  139454 139555 +      J     137455 138454 TRUE

Which is wrong. They might be indirect matches, but there there is not a direct overlap.

Stamina answered 5/1, 2023 at 16:29 Comment(8)
I think it depends on what you mean by overlap. Check the type argument of iv_overlaps. Can you clarify what you mean by overlap? For instance in your new data, Row 7 overlaps with row 8, at least partially, which is why iv_overlaps return true. Will update my answer accordingly.Delay
Indeed only one-row overlaps. But the rest do not. That's a great catch Maël. So ideally, I should have 9 rows with FALSE and one with TRUEStamina
Right. Try my code without any then.Delay
@Stamina if you remove the outer any() of @Maël's or my answers, you can get 9 rows with FALSE and one with TRUE with the new data, is that what you want? But with that rule your first dataset will get (F, T, F, F). Obviously the two rules in the old and new datasets are not consistent.Beore
You are right, Darren, but there should be consistent. Theoretically, its something so easy. I want to check if any range2 overlaps with any range1. With any I get overlaps that do not exist. It might give a direct outcome to the small dataset, but that does not mean that it is correct. The rule is the same in both datasets. The new dataset does not introduce a new question. It only checks the codeStamina
The main issue was that different interpretations of your question could lead to the same results, which is why when you tried on a different scenario it did not work any more. But now I think we've understoodDelay
You wanted to know if any range2 overlapped with any range1 by the chr column, so shouldn't your second example just be Chr2 -> TRUE?Crosscountry
in the second example, there should be all FALSE except of one that is TRUE. good pointStamina
B
5

There are several interpretations to your questions, so here are three possible cases:

  1. Within a group, detect for each [start1, end1] if they overlap with any of [start2, end2].
  2. Within a group, detect if any of [start1, end1] overlap with any of [start2, end2].
  3. Within a group, detect if each of [start1, end1] overlap with their corresponding [start2, end2] (the one on the same row).

In the three cases, you can use ivs::iv_overlaps.


Case 1

iv_overlaps will detect, within each group, if the intervals defined in [start1, end1] overlap in any way with any of the intervals [start2, end2]. It'll return a logical vector of the length of [start1, end1].

library(ivs)
library(dplyr)
df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = iv_overlaps(iv(start1, end1), iv(start2, end2)))

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 FALSE  
2     1    200   400 Penguin    160   170 TRUE   
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE  

Case 2

If you want to know if any (not each) of the intervals 1 overlaps with any of the intervals 2 (so one unique value per group), you should add any:

df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 TRUE   
2     1    200   400 Penguin    160   170 TRUE   
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE  

Case 3

If you want rowwise overlap detection, then you should use map2 with iv_overlaps:

df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = map2_lgl(iv(start1, end1), iv(start2, end2), iv_overlaps))

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 FALSE  
2     1    200   400 Penguin    160   170 FALSE  
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE  

Order of the comparison

Indeed, if one wants to compare the second intervals with the first, one should use iv_overlaps(interval2, interval1):

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 TRUE   
2     1    200   400 Penguin    160   170 FALSE  
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE  

Data

df1 <- tibble(chr=c(1,1,2,2),               start1=c(100,200,100,200),               end1=c(150,400,150,400),               species=c("Penguin"),                start2=c(200,160,500,1000),                end2=c(250,170,1000,2000) )
Benedikta answered 5/1, 2023 at 16:45 Comment(10)
Thank you, Maël. I have put a piece of real data. It seems that in my computer the code does not work for me. I dunno why. Clearly, there are no overlaps and give an overlap. Thank you so so much for taking the timeStamina
Thank you, Maël. I think it still does not work universally with this code. If you have a dataset like this df1 <- tibble(chr=c(1,1,2,2), start1=c(100,200,100,200), end1=c(150,400,150,400), species=c("Penguin"), start2=c(200,160,500,1000), end2=c(250,170,1000,2000) ) it gives you an overlap when there is no overlapStamina
You mean in row 2? there is overlap between row 2 interval 1 and row 1 interval 2. you don't want to detect overlap even in that case? so only rowwise? Check my last editDelay
I am sorry that I was not clear. It should not be row-wise. We have range2, and then we want to compare each range2 with each range1. If there is overlap, then its TRUE. We want to compare each row of range2, with all rows of range1 and then drive an outcomeStamina
Can you explain to me why you said 'it gives you an overlap when there is no overlap'? what would be the expected outcome here?Delay
If I understand now, you just need to inverse the intervals in iv_overlaps: df1 %>% group_by(chr) %>% mutate(overlap = iv_overlaps(iv(start2, end2), iv(start1, end1)))Delay
you are right. For example in Chr1, you would expect the 200-250 (range 2) interval to have an overlap with the (range1) 200-400. It should TRUE,FALSE,FALSE,FALSEStamina
Great, then inverse the interval is the solution.Delay
and then at the same time this need to work also for the data. I really admire you Maël for keeping up with this. You are greatStamina
wow thats quite an investigation in hereBaily
B
5

Scenario 1: Element-wise detection for overlapping

library(dplyr)

df1 %>%
  group_by(chr) %>%
  mutate(OVERLAP = any(start1 <= end2 & end1 >= start2)) %>%
  ungroup()

# # A tibble: 4 × 7
#     chr start1  end1 species start2  end2 OVERLAP
#   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
# 1     1    100   150 Penguin    200   250 TRUE   
# 2     1    200   400 Penguin    200   240 TRUE   
# 3     2    100   150 Penguin    500  1000 FALSE  
# 4     2    200   400 Penguin   1000  2000 FALSE

Scenario 2: Element-wise detection for overlapping with sorting

If the intervals are directed, i.e. end can be less than start, then you need to do sorting before determine overlaps.

df1 %>%
  group_by(chr) %>%
  mutate(OVERLAP = any(pmin(start1, end1) <= pmax(start2, end2) &
                       pmax(start1, end1) >= pmin(start2, end2)))

Scenario 3: Cross detection for overlapping with sorting

Furthermore, if you want to check if an interval (start1, end1) overlaps any of the intervals (start2, end2), as which ivs::iv_overlaps() works, then you can implement it with purrr::map2.

df1 %>%
  group_by(chr) %>%
  mutate(OVERLAP = any(
    purrr::map2_lgl(start1, end1,
                    ~ any(min(.x, .y) <= pmax(start2, end2) &
                          max(.x, .y) >= pmin(start2, end2)))
  ))
Beore answered 5/1, 2023 at 16:36 Comment(4)
Will this work even if the overlap is on different column?Delay
I checked and it does not work on the specific case. This depends on what OPs means by anyDelay
@Benedikta Thanks. Yes this depends on OP's request. If he want the effect of ivs::iv_overlaps(), I provide an alternative to achieve it.Beore
Thank you so much Darren for having a look. I tested the code in real data and did not work for me. I posted some real data in the question. Could you also have a lookStamina
B
5

There are several interpretations to your questions, so here are three possible cases:

  1. Within a group, detect for each [start1, end1] if they overlap with any of [start2, end2].
  2. Within a group, detect if any of [start1, end1] overlap with any of [start2, end2].
  3. Within a group, detect if each of [start1, end1] overlap with their corresponding [start2, end2] (the one on the same row).

In the three cases, you can use ivs::iv_overlaps.


Case 1

iv_overlaps will detect, within each group, if the intervals defined in [start1, end1] overlap in any way with any of the intervals [start2, end2]. It'll return a logical vector of the length of [start1, end1].

library(ivs)
library(dplyr)
df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = iv_overlaps(iv(start1, end1), iv(start2, end2)))

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 FALSE  
2     1    200   400 Penguin    160   170 TRUE   
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE  

Case 2

If you want to know if any (not each) of the intervals 1 overlaps with any of the intervals 2 (so one unique value per group), you should add any:

df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 TRUE   
2     1    200   400 Penguin    160   170 TRUE   
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE  

Case 3

If you want rowwise overlap detection, then you should use map2 with iv_overlaps:

df1 %>% 
  group_by(chr) %>% 
  mutate(overlap = map2_lgl(iv(start1, end1), iv(start2, end2), iv_overlaps))

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 FALSE  
2     1    200   400 Penguin    160   170 FALSE  
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE  

Order of the comparison

Indeed, if one wants to compare the second intervals with the first, one should use iv_overlaps(interval2, interval1):

# A tibble: 4 × 7
# Groups:   chr [2]
    chr start1  end1 species start2  end2 overlap
  <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
1     1    100   150 Penguin    200   250 TRUE   
2     1    200   400 Penguin    160   170 FALSE  
3     2    100   150 Penguin    500  1000 FALSE  
4     2    200   400 Penguin   1000  2000 FALSE  

Data

df1 <- tibble(chr=c(1,1,2,2),               start1=c(100,200,100,200),               end1=c(150,400,150,400),               species=c("Penguin"),                start2=c(200,160,500,1000),                end2=c(250,170,1000,2000) )
Benedikta answered 5/1, 2023 at 16:45 Comment(10)
Thank you, Maël. I have put a piece of real data. It seems that in my computer the code does not work for me. I dunno why. Clearly, there are no overlaps and give an overlap. Thank you so so much for taking the timeStamina
Thank you, Maël. I think it still does not work universally with this code. If you have a dataset like this df1 <- tibble(chr=c(1,1,2,2), start1=c(100,200,100,200), end1=c(150,400,150,400), species=c("Penguin"), start2=c(200,160,500,1000), end2=c(250,170,1000,2000) ) it gives you an overlap when there is no overlapStamina
You mean in row 2? there is overlap between row 2 interval 1 and row 1 interval 2. you don't want to detect overlap even in that case? so only rowwise? Check my last editDelay
I am sorry that I was not clear. It should not be row-wise. We have range2, and then we want to compare each range2 with each range1. If there is overlap, then its TRUE. We want to compare each row of range2, with all rows of range1 and then drive an outcomeStamina
Can you explain to me why you said 'it gives you an overlap when there is no overlap'? what would be the expected outcome here?Delay
If I understand now, you just need to inverse the intervals in iv_overlaps: df1 %>% group_by(chr) %>% mutate(overlap = iv_overlaps(iv(start2, end2), iv(start1, end1)))Delay
you are right. For example in Chr1, you would expect the 200-250 (range 2) interval to have an overlap with the (range1) 200-400. It should TRUE,FALSE,FALSE,FALSEStamina
Great, then inverse the interval is the solution.Delay
and then at the same time this need to work also for the data. I really admire you Maël for keeping up with this. You are greatStamina
wow thats quite an investigation in hereBaily
W
2

If you want to check whether the overlap occurs in either direction, you need:

df1 %>%
  group_by(chr) %>%
  mutate(overlap = (max(end1) > min(start2) & min(start2) > min(start1))|
                   (max(end2) > min(start1) & min(start1) > min(start2))) 
#> # A tibble: 4 x 7
#> # Groups:   chr [2]
#>     chr start1  end1 species start2  end2 overlap
#>   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
#> 1     1    100   150 Penguin    200   250 TRUE   
#> 2     1    200   400 Penguin    200   240 TRUE   
#> 3     2    100   150 Penguin    500  1000 FALSE  
#> 4     2    200   400 Penguin   1000  2000 FALSE

Created on 2023-01-05 with reprex v2.0.2

Wally answered 5/1, 2023 at 16:41 Comment(2)
@DarrenTsai yes, you were right - updated now.Wally
Thank you so so much, Allan. I appreciate you took the time. However, when I check the code in real data I do not take the results I wish, and I am so puzzled. Could you also have a look? A piece of real data is in the questionStamina
C
2

If your definition of overlap is not overlap as in Darren's answer https://mcmap.net/q/1636702/-find-which-column-ranges-overlap-after-grouping-in-r but containment ((start1 >= start2 & end1 <= end2) | (start2 >= start1 & end2 <= end1)) then take the answer and use the logic you want.

I use a cross join to make sure you compare all rows under the same chr.

Unfortunately there IS undeniably a full containment in your test data -

 chr   start1   end1 strand gene  start2   end2 overlap
 7 Chr2   82061  84742 +      G      80062  81061 TRUE   
 8 Chr2   84811  86692 +      H      82812  83811 TRUE   

[start2, end2] for H is contained in [start1, end1] for G.

Code (note that performance will degrade rapidly if there are a lot of records under a single chr - over 200 is likely to be intolerable, and you'll want an implementation that doesn't involve a self-cross.

check_overlap = function(df){
  df %>% mutate(temp_id = 1:nrow(df)) %>%
    inner_join(., ., by='chr') %>%
    filter(temp_id.x != temp_id.y) %>%
    mutate(overlaps = start1.x <= end2.y & end1.x >= start2.y) %>%
    group_by(chr) %>%
    summarise(OVERLAP = any(overlaps)) %>%
    inner_join(df, by = 'chr')
}

check_containment = function(df){
  df %>% mutate(temp_id = 1:nrow(df)) %>%
    inner_join(., ., by='chr') %>%
    filter(temp_id.x != temp_id.y) %>%
    mutate(overlaps = (start1.x >= start2.y & end1.x <= end2.y) | (start2.y >= start1.x & end2.y <= end1.x)) %>%
    group_by(chr) %>%
    summarise(OVERLAP = any(overlaps)) %>%
    inner_join(df, by = 'chr')
}
Crosscountry answered 6/1, 2023 at 1:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.