I have a huge data frame that looks like this.
I want to group_by(chr)
, and then for each chr
to find
- Is any range1 (start1, end1), within the chr group, overlapping with any range2 (start2,end2)?
library(dplyr)
df1 <- tibble(chr=c(1,1,2,2),
start1=c(100,200,100,200),
end1=c(150,400,150,400),
species=c("Penguin"),
start2=c(200,200,500,1000),
end2=c(250,240,1000,2000)
)
df1
#> # A tibble: 4 × 6
#> chr start1 end1 species start2 end2
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 100 150 Penguin 200 250
#> 2 1 200 400 Penguin 200 240
#> 3 2 100 150 Penguin 500 1000
#> 4 2 200 400 Penguin 1000 2000
Created on 2023-01-05 with reprex v2.0.2
I want my data to look like this. Essentially I want to check if the range2 overlaps with any range1. The new data does not change the question, but proof checks the code
# A tibble: 4 × 6
chr start1 end1 species start2 end2 OVERLAP
1 100 150 Penguin 200 250 TRUE
1 200 400 Penguin 200 240 TRUE
2 100 150 Penguin 500 1000 FALSE
2 200 400 Penguin 1000 2000 FALSE
I have fought a lot with the ivs
package and iv_overlaps
with no success in getting what I want.
Major EDIT:
When I apply any of the codes in real data, I am not getting the results I want, and I am so confused. Why? The new data dataset does not change the question, but proofs check the code
data <- tibble::tribble(
~chr, ~start1, ~end1, ~strand, ~gene, ~start2, ~end2,
"Chr2", 2739, 2840, "+", "A", 740, 1739,
"Chr2", 12577, 12678, "+", "B", 10578, 11577,
"Chr2", 22431, 22532, "+", "C", 20432, 21431,
"Chr2", 32202, 32303, "+", "D", 30203, 31202,
"Chr2", 42024, 42125, "+", "E", 40025, 41024,
"Chr2", 51830, 51931, "+", "F", 49831, 50830,
"Chr2", 82061, 84742, "+", "G", 80062, 81061,
"Chr2", 84811, 86692, "+", "H", 82812, 83811,
"Chr2", 86782, 88106, "-", "I", 88107, 89106,
"Chr2", 139454, 139555, "+", "J", 137455, 138454,
)
data %>%
group_by(chr) %>%
mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))
then It gives as an output
chr start1 end1 strand gene start2 end2 overlap
<chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <lgl>
1 Chr2 2739 2840 + A 740 1739 TRUE
2 Chr2 12577 12678 + B 10578 11577 TRUE
3 Chr2 22431 22532 + C 20432 21431 TRUE
4 Chr2 32202 32303 + D 30203 31202 TRUE
5 Chr2 42024 42125 + E 40025 41024 TRUE
6 Chr2 51830 51931 + F 49831 50830 TRUE
7 Chr2 82061 84742 + G 80062 81061 TRUE
8 Chr2 84811 86692 + H 82812 83811 TRUE
9 Chr2 86782 88106 - I 88107 89106 TRUE
10 Chr2 139454 139555 + J 137455 138454 TRUE
Which is wrong. They might be indirect matches, but there there is not a direct overlap.
any()
of @Maël's or my answers, you can get 9 rows with FALSE and one with TRUE with the new data, is that what you want? But with that rule your first dataset will get (F, T, F, F). Obviously the two rules in the old and new datasets are not consistent. – Beoreany
I get overlaps that do not exist. It might give a direct outcome to the small dataset, but that does not mean that it is correct. The rule is the same in both datasets. The new dataset does not introduce a new question. It only checks the code – Staminachr
column, so shouldn't your second example just beChr2
->TRUE
? – Crosscountry