Processing the input file based on range overlap
Asked Answered
C

1

0

I have a huge input file (a representative sample of which is shown below as input):

> input
           CT1           CT2           CT3
1 chr1:200-400  chr1:250-450  chr1:400-800
2 chr1:800-970  chr2:200-500  chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400

I want to process it by following some rules (described below) so that I get an output like:

 > output
              CT1 CT2 CT3
chr1:200-400    1   1   0
chr1:800-970    1   0   0
chr2:300-700    1   1   0
chr1:250-450    1   1   0
chr2:200-500    1   1   0
chr2:600-1000   0   1   1
chr1:400-800    0   0   1
chr1:700-870    0   1   1
chr2:700-1400   0   1   1

Rules: Take every index (the first in this case is chr1:200-400), see if it overlap significantly with values in other column. From significantly I mean atleast 50% overlap of the range. If yes, write 1 below that column in which it exists, if not write 0.

Now I explain how I got the output table. From our input we take 1st index input[1,1] which is chr1:200-400. As it exists in column 1 we will write 1 below it. Now we will check if this or an overlapping range exists in any other column. This value overlaps only with the first value (chr1:250-450) of the second column (CT2). Now we check if this overlap is significant or not. We know that range is 200 (chr1:200-400), the overlap with 2nd column value (which is chr1:250-450) is 150 (250-400). As this overlap of 150 is greater than the half (50% of original range = 100) of original range (200-400 = 200) OR overlapping range (250-450 = 200). We consider it an overlap and assign 1 below the column CT2 as well. As this range does not overlap with any value in CT3, we write 0 below CT3. Similarly for row 9 of output. chr2:700-1400 doesn't exist in CT1 so write 0 below it. For CT2 it overlaps with chr2:600-1000. The original range here is 700 (chr2:700-1400), half of it is 350. The overlap with chr2:700-1000 of CT2 is 300 (out of actual range chr2:600-1000). Now this overlap of 300 is not greater than the half of actual range 700 (chr2:700-1400 of CT3) but it is greater than the half of overlapping range 400 (chr2:600-1000 of CT2). Therefore, we consider it an overlap and write 1 below CT2. As this range actually exists in CT3, we write 1 below it as well.

Here are the dput of input and output:

> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450", 
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800", 
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1", 
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L, 
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L, 
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500", 
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
Corium answered 10/1, 2018 at 12:42 Comment(1)
Any hint about how to compare one value of a dataframe df[1,1] against the rest of the dataframe..???Corium
B
1

To do this requires many steps, and a number of concepts from the data.table package, most notably non-equi joins. I've commented the code throughout, and recommend running it step by step if you want more insight:

library(data.table)

input <- structure(list(CT1 = structure(1:3, .Label = 
  c("chr1:200-400", "chr1:800-970", "chr2:300-700"), class = 
  "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450", 
  "chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = 
  structure(1:3, .Label = c("chr1:400-800", "chr1:700-870", 
  "chr2:700-1400"), class = "factor")), .Names = c("CT1", 
  "CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))

output <- structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L),
  CT2 = c(1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 
  0L, 0L, 1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = 
  "data.frame", row.names = c("chr1:200-400", "chr1:800-970", 
  "chr2:300-700", "chr1:250-450", "chr2:200-500", "chr2:600-1000",
  "chr1:400-800", "chr1:700-870", "chr2:700-1400"))

# Builds a data.table by breaking a string like "chr1:300-700" into 
# three columns: chr, start, and end.
split_genomic_range <- function(str) {
  chr <- gsub(":.*", "", str)
  start <- gsub("-.*", "", gsub(".*:", "", str))
  end <- gsub(".*-", "", str)

  start <- as.numeric(start)
  end <- as.numeric(end)

  return(data.table(chr=chr, start=start, end=end))
}

# First break the input data.table into three new tables - we will need
# to perform non-equi joins of the index table (column CT1 in input) to
# the tables built from the other two columns.
ct1 <- split_genomic_range(input$CT1)
ct2 <- split_genomic_range(input$CT2)
ct3 <- split_genomic_range(input$CT3)

# Create an index table with all genomic ranges, then check for 
# overlaps in each of the three tables created from the input
# columns:
index_table <- unique(rbind(ct1, ct2, ct3))

# Returns entries from the index_table if they overlap > 50% any 
# entries in the lookup table or vice-versa
get_overlapping_ranges <- function(index_table, lookup_table) {
  # This function does two non-equi joins. First, it checks whether 
  # any entries in the index_table have a 50% overlap with any 
  # entries in  the lookup table. Second, it does the reverse, and  
  # checks whether any entries in the lookup_table have a 50% overlap 
  # with any entries in the index_table. This is required due to 
  # differing window sizes:
  # e.g. 0-20 significantly overlaps 10-100, but 10-100 does not 
  # significantly overlap 0-20.

  # We will need to create a "middle" column for each genomic range.
  # We will need to create copies of each table to do this, otherwise
  # they will end up with this new column as a side effect of the 
  # function call.
  index_copy <- copy(index_table)
  lookup_copy <- copy(lookup_table)

  index_copy[, middle := start + (end - start) * 0.5]
  lookup_copy[, middle := start + (end - start) * 0.5]

  # In the index_copy we will also need to create dummy columns for
  # the check. We need to do this so we can find the appropriate 
  # genomic window from the index table when we do the second  
  # non-equi join, otherwise the start and end columns will be 
  # clobbered. 
  index_copy[, start_index := start]
  index_copy[, end_index := end]

  # If the middle of a genomic range in the index table falls within 
  # a genomic range in the lookup table, then that genomic entry from 
  # the index table has a significant overlap with the corresponding 
  # in the lookup table
  index_overlaps <- index_copy[lookup_table, 
    on=.(chr, middle >= start, middle <= end),
    nomatch=0]

  # Test the reverse: does any entry in the lookup table 
  # significantly  overlap with any of the genomic windows in the 
  # index table?
  lookup_overlaps <- index_copy[lookup_copy,
    on=.(chr, start_index <= middle, end_index >= middle),
    nomatch=0]

  # Remove extra columns created by the non-equi join:
  index_overlaps <- index_overlaps[,.(chr, start, end)]
  lookup_overlaps <- lookup_overlaps[,.(chr, start, end)]

  # Combine results and remove any duplicates that arise, e.g. 
  # because a genomic window in the index_table significantly 
  # overlaps with multiple genomic windows in the lookup table, or 
  # because the overlap is significant in both comparisons (i.e. 
  # where both windows are the same size).
  overlaps <- unique(rbind(index_overlaps, lookup_overlaps))

  return(overlaps)
}

ranges_in_ct1 <- get_overlapping_ranges(index_table, ct1)
ranges_in_ct2 <- get_overlapping_ranges(index_table, ct2)
ranges_in_ct3 <- get_overlapping_ranges(index_table, ct3)

# Combine results, indicating which column each genomic range was 
# found to overlap:
overlaps <- rbind(
  CT1=ranges_in_ct1, CT2=ranges_in_ct2, CT3=ranges_in_ct3,
  idcol="input_column"
) 

# Recombine the chr, start, and end columns to the original format:
overlaps[, genomic_window := paste0(chr, ":", start, "-", end)]
overlaps[, c("chr", "start", "end") := NULL]

# Convert to the wide format, so that each input column either has a  
# 1 or 0 if the genomic window overlaps with 50% any other found in 
# that column
overlaps <- dcast(overlaps, genomic_window ~ input_column, 
                  fun.aggregate = length)

# Convert back to a data.frame:
overlaps <- as.data.frame(overlaps)
rownames(overlaps) <- overlaps$genomic_window
overlaps <- overlaps[,-1]

# Reorder so we can double check against the desired output:
overlaps <- overlaps[rownames(output),]
print(overlaps)

This will generate (almost) the same output you've provided:

              CT1 CT2 CT3
chr1:200-400    1   1   0
chr1:800-970    1   0   0
chr2:300-700    1   1   0
chr1:250-450    1   1   0
chr2:200-500    1   1   0
chr2:600-1000   0   1   1
chr1:400-800    0   0   1
chr1:700-870    0   0   1
chr2:700-1400   0   1   1

The only difference is that chr1:700-870 has a 0 in the CT2 column: this is because it actually doesn't overlap any of the genomic windows in CT2, the only other window on chromosome 1 was chr1:250-450.

Bauske answered 12/1, 2018 at 0:51 Comment(1)
Many thanks for such a detailed answer. Can you please help me making it generalized to an input column, rather than explicitly mentioning the column names. For example for the first part where we apply split_genomic_range can be modified as for (i in 1:length(colnames(input))) { assign(paste(colnames(input[i])), split_genomic_range(input[,i])) } But I am unable to do the same for get_overlapping_ranges function.Corium

© 2022 - 2024 — McMap. All rights reserved.