Distance matrix to pairwise distance list in R
Asked Answered
L

5

8

Is there any R package to obtain a pairwise distance list if my input file is a distance matrix For eg, if my input is a data.frame like this:

        A1      B1      C1      D1
 A1     0      0.85    0.45    0.96 
 B1            0       0.85    0.56
 C1                    0       0.45
 D1                            0

I want the output as:

A1  B1  0.85
A1  C1  0.45
A1  D1  0.96
B1  C1  0.85
B1  D1  0.56
C1  D1  0.45

I found a question to do the opposite function using package 'reshape' but could not tweak it to get what I wanted.

Lacuna answered 11/1, 2015 at 21:6 Comment(2)
Please post the output of dput(your-distance-object) so we are not guessing whether you are actually dealing with a data.frame, a matrix, a table, an actual distance matrix, or something else entirely. This would definitely influence the applicability of the answers presented so far. I ask this because your title says "distance matrix" (which is generally created using the dist function), but your question description says you're dealing with a data.frame. These are quite different.Hampton
I'm also suspicious about this... distance matrices generated with dist print the lower triangle by default, not the upper triangle. And are your blank cells NA, or simply hidden (as with the print method for dist objects)?Aventine
A
14

A couple of other options:

  1. Generate some data

    D <- dist(cbind(runif(4), runif(4)), diag=TRUE, upper=TRUE) # generate dummy data
    m <- as.matrix(D) # coerce dist object to a matrix
    dimnames(m) <- dimnames(m) <- list(LETTERS[1:4], LETTERS[1:4]) 
    
  2. Assuming you just want the distances for pairs defined by the upper triangle of the distance matrix, you can do:

    xy <- t(combn(colnames(m), 2))
    data.frame(xy, dist=m[xy])
    
    #  X1 X2      dist
    # 1 A  B 0.3157942
    # 2 A  C 0.5022090
    # 3 A  D 0.3139995
    # 4 B  C 0.1865181
    # 5 B  D 0.6297772
    # 6 C  D 0.8162084
    
  3. Alternatively, if you want distances for all pairs (in both directions):

    data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(m)], dist=c(m))
    
    #    col row      dist
    # 1    A   A 0.0000000
    # 2    A   B 0.3157942
    # 3    A   C 0.5022090
    # 4    A   D 0.3139995
    # 5    B   A 0.3157942
    # 6    B   B 0.0000000
    # 7    B   C 0.1865181
    # 8    B   D 0.6297772
    # 9    C   A 0.5022090
    # 10   C   B 0.1865181
    # 11   C   C 0.0000000
    # 12   C   D 0.8162084
    # 13   D   A 0.3139995
    # 14   D   B 0.6297772
    # 15   D   C 0.8162084
    # 16   D   D 0.0000000
    

    or the following, which excludes any NA distances, but doesn't keep the column/row names (though this would be easy to rectify since we have the column/row indices):

    data.frame(which(!is.na(m), arr.ind=TRUE, useNames=FALSE), dist=c(m))
    
Aventine answered 12/1, 2015 at 4:54 Comment(5)
I get the following error msg. Any idea why ? Error in m[xy] : subscript out of boundsLacuna
@AnuragMishra When you run my code? Or when you apply it to your data?Aventine
When I apply it to my data, which is a dataframe.Lacuna
@AnuragMishra Please edit your question and add the output of dput(d), where d is your dataframe. If d is too large to include in this way, then provide a small subset of it for us to work with.Aventine
I am using two columns from a data frame as the X and Y coordinates to find distances. dput() gives me the following Size = 121L, Diag = TRUE, Upper = TRUE, method = "euclidean", call = dist(x = cbind(x$da1, x$da2), diag = TRUE, upper = TRUE), class = "dist") x$da1 and x$da2 are my two columns from the data frame 'x' Is this what you wanted ?Lacuna
K
7

If you have a data.frame you could do something like:

df <- structure(list(A1 = c(0, 0, 0, 0), B1 = c(0.85, 0, 0, 0), C1 = c(0.45, 
0.85, 0, 0), D1 = c(0.96, 0.56, 0.45, 0)), .Names = c("A1", "B1", 
"C1", "D1"), row.names = c(NA, -4L), class = "data.frame")

data.frame( t(combn(names(df),2)), dist=t(df)[lower.tri(df)] )
  X1 X2 dist
1 A1 B1 0.85
2 A1 C1 0.45
3 A1 D1 0.96
4 B1 C1 0.85
5 B1 D1 0.56
6 C1 D1 0.45

Another approach if you have it as a matrix with row+col-names is to use reshape2 directly:

mat <- structure(c(0, 0, 0, 0, 0.85, 0, 0, 0, 0.45, 0.85, 0, 0, 0.96, 
0.56, 0.45, 0), .Dim = c(4L, 4L), .Dimnames = list(c("A1", "B1", 
"C1", "D1"), c("A1", "B1", "C1", "D1")))

library(reshape2)
subset(melt(mat), value!=0)

   Var1 Var2 value
5    A1   B1  0.85
9    A1   C1  0.45
10   B1   C1  0.85
13   A1   D1  0.96
14   B1   D1  0.56
15   C1   D1  0.45
Kirt answered 11/1, 2015 at 21:43 Comment(0)
G
3

I suppose you have a contingency table or a matrix defined as follow:

mat = matrix(c(0, 0.85, 0.45, 0.96, NA, 0, 0.85, 0.56, NA, NA, 0, 0.45, NA,NA,NA,0), ncol=4)
cont = as.table(t(mat))

#     A    B    C    D
#A 0.00 0.85 0.45 0.96
#B      0.00 0.85 0.56
#C           0.00 0.45
#D                0.00

Then you simply need a data.frame conversion, and remove NA/0's:

df = as.data.frame(cont)
df = df[complete.cases(df),]
df[df[,3]!=0,]

#   Var1 Var2 Freq
#5     A    B 0.85
#9     A    C 0.45
#10    B    C 0.85
#13    A    D 0.96
#14    B    D 0.56
#15    C    D 0.45
Granule answered 11/1, 2015 at 21:49 Comment(0)
U
1

Tidymodels Answer

This is exactly the type of thing that the broom package excels at. It is a tidymodels package.

Borrowing the dummy data from jbaums answer.

D <- dist(cbind(runif(4), runif(4))) # generate dummy data

This is a single function call.

library(broom)
tidy(D)

Which returns

 A tibble: 6 x 3
  item1 item2 distance
  <fct> <fct>    <dbl>
1 1     2        0.702
2 1     3        0.270
3 1     4        0.292
4 2     3        0.960
5 2     4        0.660
6 3     4        0.510

Note, it also works for different values of diag and upper as well.

tidy(dist(cbind(runif(4), runif(4)), diag=TRUE, upper=TRUE))
tidy(dist(cbind(runif(4), runif(4)), diag=FALSE, upper=TRUE))
tidy(dist(cbind(runif(4), runif(4)), diag=TRUE, upper=FALSE))
Umbles answered 10/5, 2022 at 22:5 Comment(0)
C
0

Here is an example using the spaa-package.

exampleInput <- structure(list(A1 = c(0, 0, 0, 0), B1 = c(0.85, 0, 0, 0), 
C1 = c(0.45, 0.85, 0, 0), D1 = c(0.96, 0.56, 0.45, 0)), 
.Names = c("A1", "B1", "C1", "D1"), row.names = c(NA, -4L), class = "data.frame")

library(spaa)
pairlist <- dist2list(as.dist(t(exampleInput)))
pairlist[as.numeric(pairlist$col) > as.numeric(pairlist$row),]

Output:

   col row value
2   B1  A1  0.85
3   C1  A1  0.45
4   D1  A1  0.96
7   C1  B1  0.85
8   D1  B1  0.56
12  D1  C1  0.45
Cartagena answered 11/1, 2015 at 21:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.