R - comparing two rows by columns and writing the result in a table

Asked 14/5, 2016 at 14:42 Answered 14/5, 2016 at 18:7

I'm an R newbie and probably the solution for my problem is very simple but it's out of my reach for now... I would like to compare rows in a data frame by columns. The data in each column is a letter (nucleotide base):

seq1 A C T G T
seq2 A C G G G
seq3 A G G C A
...

I'd like to compare all rows in the data set with each other by column. The result I would like to obtain is simple 1 or 0 for TRUE and FALSE in the comparison, written in a form of table as well. So it would look like this:

seq1_seq2 1 1 0 1 0
seq1_seq3 1 0 0 0 0
seq2_seq3 1 0 1 0 0
...

My skills in R are too low to write something useful. However, I managed to find out that

ifelse(data[1,]==data[2,], 1, 0)

returns almost what I need although without showing which rows are compared (no seq1_seq2 column). I would appreciate any help on this problem. Of course, an example of complete solution would be the most desired but I will be gratefull also for any suggestions about how to solve this problem.

Thank you in advance!

Harim answered 14/5, 2016 at 14:42 Comment(0)

Storing sequences in dataframe by rows is wrong. You should store sequences by columns, or, if you store them by rows, at least do it in a matrix rather than dataframe. Below I assume you use a matrix. You can transform dataframe to a matrix with as.matrix function.

If you want to avoid loops, you should use combn for such tasks

> a
     [,1] [,2] [,3] [,4] [,5]
seq1 "A"  "C"  "T"  "G"  "T" 
seq2 "A"  "C"  "G"  "G"  "G" 
seq3 "A"  "G"  "G"  "C"  "A" 

> compare = t(combn(nrow(a),2,FUN=function(x)a[x[1],]==a[x[2],]))
> rownames(compare) = combn(nrow(a),2,FUN=function(x)paste0("seq",x[1],"_seq",x[2]))

> compare
          [,1]  [,2]  [,3]  [,4]  [,5]
seq1_seq2 TRUE  TRUE FALSE  TRUE FALSE
seq1_seq3 TRUE FALSE FALSE FALSE FALSE
seq2_seq3 TRUE FALSE  TRUE FALSE FALSE

To transform booleans to integers (if you really need it):

storage.mode(compare) = "integer"

Technetium answered 14/5, 2016 at 15:36 Comment(4)

This solution is considerably faster even as it does the n-squared computation. So, I would go with this one instead of the double loop. – Laurinelaurita 14/5, 2016 at 16:3

@Laurinelaurita in terms of asymptotic speed, all three solutions are the same. If n is number of elements in a sequence (5 in the example), the computation is O(n). – Technetium 14/5, 2016 at 16:43

Yes, I understand that....it is n-squared operations. However, combn is executing the loop much more efficiently is the point I was trying to make. – Laurinelaurita 14/5, 2016 at 16:54

I would like to thank everyone for their responses! Guys, you are awesome! I have tested all solutions and marked user31264 response as the one solving problem, since it was the fastest. Dominic's answer worked great on small, test data set. However, I tried this on approx 1800 sequences and Gopala's script consumed my all 16Gb RAM within 20min, resulting in crash of the R session. User31264's solution was as Gopala noticed, considerably faster. On the 1800 seq data set it generated over 500Mb result file within 5 minutes. Again, big thanks to all who responded! – Harim 15/5, 2016 at 12:45

In this case, since you want all n-squared comparisons done, looping this way is one option:

result <- list()
for (i in 1:(nrow(df) - 1)) {
    for (j in (i + 1):nrow(df)) {
      result[[paste(row.names(df)[i], row.names(df)[j], sep = '_')]] <- as.integer(df[i, ] == df[j, ])
    }
}
as.data.frame(do.call(rbind, result))

Resulting output will be as follows:

          V1 V2 V3 V4 V5
seq1_seq2  1  1  0  1  0
seq1_seq3  1  0  0  0  0
seq2_seq3  1  0  1  0  0

Of course, this will be very slow for larger data sets.

Laurinelaurita answered 14/5, 2016 at 15:17 Comment(2)

for i in 1:(nrow(df)-1); for j in (i+1):nrow(df) – Technetium 14/5, 2016 at 15:23

I was just literally making that edit noticing my error in avoiding extra computations. – Laurinelaurita 14/5, 2016 at 15:25

A somewhat different approach than Gopala's... There's probably a simpler way to get there, but here it is:

options(stringsAsFactors = FALSE)
myData <- data.frame(n1=c("A","A","A"),n2=c("C","C","G"),
                     n3=c("T","G","G"),n4=c("G","G","C"),n5=c("T","G","A"))
rownames(myData) <- paste0("seq",1:3)

# Generate all combinations for comparisons
compar <- apply(combn(rownames(myData),2),2,paste0)

# Create a temporary list having pairs of rows
myList <- apply(compar, 2, function(r) myData[r,])
names(myList) <- apply(combn(rownames(myData),2),2,paste0,collapse="_")

# Compare the two rows for each element in the list
results <- t(sapply(myList, function(x) as.numeric(x[1,]==x[2,])))
colnames(results) <- colnames(myData)

results

          n1 n2 n3 n4 n5
seq1_seq2  1  1  0  1  0
seq1_seq3  1  0  0  0  0
seq2_seq3  1  0  1  0  0

Soddy answered 14/5, 2016 at 15:26 Comment(0)

You can use this code (it uses myData from the @Dominic Comtois's answer):

m <- combn(nrow(myData),2)

result <- sapply(myData,function(C) {z=C[m];z[c(TRUE,FALSE)]==z[c(FALSE,TRUE)]})
#       n1    n2    n3    n4    n5
#[1,] TRUE  TRUE FALSE  TRUE FALSE
#[2,] TRUE FALSE FALSE FALSE FALSE
#[3,] TRUE FALSE  TRUE FALSE FALSE

How it works:

combn generates all possible pairs of row indices
sapply loops over each column of myData
For each column, obtain a vector analogue of matrix m in which row indices are substituted by the values from myData
Odd elements of this vector contain first row, and even elements contain second row, thus we can use bit mask c(TRUE,FALSE) and c(FALSE,TRUE) for the comparison of odd/even elements.

Syncretism answered 14/5, 2016 at 18:7 Comment(0)

Recommended topics

Hot tags