Padding or filling a dataframe in R if I know the range

Asked 23/2, 2023 at 21:29 Answered 24/2, 2023 at 22:21

I'm looking for something similar to bedtools subtract but with dataframes.

For example, say I have the range as a dataframe here:

Start End Value
0 100 P

And I have another dataframe, which is sorted:

Start End Value
10 25 A
50 63 B

Would there be a way to fill this like so:

Start End Value
 0   9 P1
10  25 A
26  49 P2
50  63 B
64 100 P3

P1, P2 and P3 labels which are filled in to pad the 2nd dataframe so that the entire range of value gets covered.

I tried using Dplyr's Lag function and adding the padding values manually, but given that the range can change depending on the length of genomic feature (including the start and end co-ordinates), I wanted this range filling to be automatic.

Thank you!

For example, this is a small subset of the data:

data_range<- data.frame(start=0, end=100, value="P")

tofill_range<- data.frame(start=c(15, 51, 70),end = c(39, 62, 79), value = c("A","B","C"))

Heeler answered 23/2, 2023 at 21:29 Comment(2)

Any reason for not using bedtools without R? There is also wrapper R packages for bedtools: bedr, bedtoolsr. And native R package bedtorch. – Socket 23/2, 2023 at 21:45

The data really isn't a chromosomal co-ordinate data since the value column needs to be filled with metadata, I used that as an example, because it was the closest example that I could think of! – Heeler 23/2, 2023 at 21:54

In base R:

all_ranges <- function(df1, df2){
  a <- sort(c(t(df1[-3]), t(df2[-3]), t(df2[-3]) + c(-1,1)))
  b <- data.frame(t(matrix(a,2)))
  d <- merge(df2, setNames(b, names(df1)[-3]), all = TRUE)
  replace(d, is.na(d), paste0(df1[,3], seq(sum(is.na(d)))))
}

data_range<- data.frame(start=0, end=100, value="P")

tofill_range<- data.frame(start=c(15, 51, 70),end = c(39, 62, 79), value = c("A","B","C"))

all_ranges(data_range, tofill_range)
#>   start end value
#> 1     0  14    P1
#> 2    15  39     A
#> 3    40  50    P2
#> 4    51  62     B
#> 5    63  69    P3
#> 6    70  79     C
#> 7    80 100    P4

^{Created on 2023-02-23 with reprex v2.0.2}

Giule answered 24/2, 2023 at 1:53 Comment(0)

Here is one way to calculate the range of a data.frame with just using 'dplyr'. For your second example I renamed the columns. We could put some more work in to make it work with any column names.

library(dplyr)

calc_range <- function(df1, df2) {
  df3 <- df2 %>% 
    transmute(start = End + 1,
              End = Start - 1) %>% 
    rename(Start = start)
  
  start_df <- bind_rows(df1, df2, df3)
  
  start_df %>% 
    select(!Value) %>% 
    unlist %>% 
    sort %>% 
    matrix(ncol = 2, byrow = TRUE) %>% 
    data.frame() %>% 
    rename(Start = X1, End = X2) %>% 
    left_join(start_df, by = c("Start", "End")) %>% 
    mutate(Value = ifelse(is.na(Value) | Value == "P",
                          paste0("P", cumsum(is.na(Value) | Value == "P")),
                          Value)) %>% 
    arrange(Start)
}

# Test 1

dfa <- tribble(
  ~Start, ~End, ~Value,
  0, 100, "P"
)

dfb <- tribble(~Start, ~End, ~Value,
               10, 25, "A",
               50, 63, "B")

calc_range(dfa, dfb)
#>   Start End Value
#> 1     0   9    P1
#> 2    10  25     A
#> 3    26  49    P2
#> 4    50  63     B
#> 5    64 100    P3

# Test 2 
data_range <- data.frame(Start=0, End=100, Value="P")

tofill_range <- data.frame(Start=c(15, 51, 70),
                          End = c(39, 62, 79),
                          Value = c("A","B","C"))

calc_range(data_range, tofill_range)
#>   Start End Value
#> 1     0  14    P1
#> 2    15  39     A
#> 3    40  50    P2
#> 4    51  62     B
#> 5    63  69    P3
#> 6    70  79     C
#> 7    80 100    P4

^{Created on 2023-02-23 with reprex v2.0.2}

Elanaeland answered 23/2, 2023 at 22:12 Comment(0)

Using dplyr (>= v1.1.0 for consecutive_id)

Get the missing ranges with between

library(dplyr)

ranges <- rowSums(apply(tofill_range[,1:2], 1, function(x) 
  between(seq(data_range$start, data_range$end), x[1], x[2])))

as_tibble(cbind(ranges, grp = consecutive_id(ranges), 
            val = seq(data_range[,1], data_range[,2]))) %>% 
  group_by(grp) %>% 
  filter(ranges == 0) %>% 
  summarize(start = first(val), 
            end = last(val), 
            value = paste0(data_range$value, cur_group_id())) %>% 
  select(-grp) %>% 
  bind_rows(., tofill_range) %>% 
  arrange(start)
# A tibble: 7 × 3
  start   end value
  <dbl> <dbl> <chr>
1     0    14 P1   
2    15    39 A    
3    40    50 P2   
4    51    62 B    
5    63    69 P3   
6    70    79 C    
7    80   100 P4

Sauer answered 23/2, 2023 at 23:52 Comment(0)

In base R:

all_ranges <- function(df1, df2){
  a <- sort(c(t(df1[-3]), t(df2[-3]), t(df2[-3]) + c(-1,1)))
  b <- data.frame(t(matrix(a,2)))
  d <- merge(df2, setNames(b, names(df1)[-3]), all = TRUE)
  replace(d, is.na(d), paste0(df1[,3], seq(sum(is.na(d)))))
}

data_range<- data.frame(start=0, end=100, value="P")

tofill_range<- data.frame(start=c(15, 51, 70),end = c(39, 62, 79), value = c("A","B","C"))

all_ranges(data_range, tofill_range)
#>   start end value
#> 1     0  14    P1
#> 2    15  39     A
#> 3    40  50    P2
#> 4    51  62     B
#> 5    63  69    P3
#> 6    70  79     C
#> 7    80 100    P4

^{Created on 2023-02-23 with reprex v2.0.2}

Giule answered 24/2, 2023 at 1:53 Comment(0)

A very suitable package for this task is "IRanges":

library(IRanges)

r1 = IRanges(start = 0, end = 100, names = "P")
r2 = IRanges(start = c(10, 50), end = c(25, 63), names = c("A", "B"))

# find gaps
dif = setdiff(r1, r2)
names(dif) = sprintf("%s%d", names(r1), seq_len(length(dif)))

# merge and sort
ans = sort(c(r2, dif))

as.data.frame(ans)
#  start end width names
#1     0   9    10    P1
#2    10  25    16     A
#3    26  49    24    P2
#4    50  63    14     B
#5    64 100    37    P3

Abernathy answered 24/2, 2023 at 22:21 Comment(0)

Recommended topics

Hot tags