R Interclass distance matrix
Asked Answered
V

2

2

This question is sort of a follow-up to how to extract intragroup and intergroup distances from a distance matrix? in R. In that question, they first computed the distance matrix for all points, and then simply extracted the inter-class distance matrix. I have a situation where I'd like to bypass the initial computation and skip right to extraction, i.e. I want to directly compute the inter-class distance matrix. Drawing from the linked example, with tweaks, let's say I have some data in a dataframe called df:

values<-c(0.002,0.3,0.4,0.005,0.6,0.2,0.001,0.002,0.3,0.01)
class<-c("A","A","A","B","B","B","B","A","B","A")
df<-data.frame(values, class)

What I'd like is a distance matrix:

    1    2    3    8   10
4 .003 .295 .395 .003 .005
5 .598 .300 .200 .598 .590
6 .198 .100 .200 .198 .190
7 .001 .299 .399 .001 .009
9 .298 .000 .100 .298 .290

Does there already exist in R an elegant and fast way to do this?

EDIT After receiving a good solution for the 1D case above, I thought of a bonus question: what about a higher-dimensional case, say if instead df looks like this:

values1<-c(0.002,0.3,0.4,0.005,0.6,0.2,0.001,0.002,0.3,0.01)
values2<-c(0.001,0.1,0.1,0.001,0.1,0.1,0.001,0.001,0.1,0.01)
class<-c("A","A","A","B","B","B","B","A","B","A")
df<-data.frame(values1, values2, class)

And I'm interested in again getting a matrix of the Euclidean distance between points in class B with points in class A.

Viscera answered 22/8, 2016 at 23:30 Comment(0)
S
3

For general n-dimensional Euclidean distance, we can exploit the equation (not R, but algebra):

square_dist(b,a) = sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) - 2*inner_prod(b,a)

where the sums are over the dimensions of vectors a and b for i=[1,n]. Here, a and b are one pair from A and B. The key here is that this equation can be written as a matrix equation for all pairs in A and B.

In code:

## First split the data with respect to the class
n <- 2   ## the number of dimensions, for this example is 2
tmp <- split(df[,1:n], df$class)

d <- sqrt(matrix(rowSums(expand.grid(rowSums(tmp$B*tmp$B),rowSums(tmp$A*tmp$A))),
                 nrow=nrow(tmp$B)) - 
          2. * as.matrix(tmp$B) %*% t(as.matrix(tmp$A)))

Notes:

  1. The inner rowSums compute sum_i(b[i]*b[i]) and sum_i(a[i]*a[i]) for each b in B and a in A, respectively.
  2. expand.grid then generates all pairs between B and A.
  3. The outer rowSums computes the sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) for all these pairs.
  4. This result is then reshaped into a matrix. Note that the number of rows of this matrix is the number of points of class B as you requested.
  5. Then subtract two times the inner product of all pairs. This inner product can be written as a matrix multiply tmp$B %*% t(tmp$A) where I left out the coercion to matrix for clarity.
  6. Finally, take the square root.

Using this code with your data:

print(d)
##          1         2         3         8         10
##4 0.0030000 0.3111688 0.4072174 0.0030000 0.01029563
##5 0.6061394 0.3000000 0.2000000 0.6061394 0.59682493
##6 0.2213707 0.1000000 0.2000000 0.2213707 0.21023796
##7 0.0010000 0.3149635 0.4110985 0.0010000 0.01272792
##9 0.3140143 0.0000000 0.1000000 0.3140143 0.30364453

Note that this code will work for any n > 1. We can recover your previous 1-d result by setting n to 1 and not perform the inner rowSums (because there is now only one column in tmp$A and tmp$B):

n <- 1   ## the number of dimensions, set this now to 1
tmp <- split(df[,1:n], df$class)

d <- sqrt(matrix(rowSums(expand.grid(tmp$B*tmp$B,tmp$A*tmp$A)),
                 nrow=length(tmp$B)) - 
          2. * as.matrix(tmp$B) %*% t(as.matrix(tmp$A)))
print(d)
##      [,1]  [,2]  [,3]  [,4]  [,5]
##[1,] 0.003 0.295 0.395 0.003 0.005
##[2,] 0.598 0.300 0.200 0.598 0.590
##[3,] 0.198 0.100 0.200 0.198 0.190
##[4,] 0.001 0.299 0.399 0.001 0.009
##[5,] 0.298 0.000 0.100 0.298 0.290
Swabber answered 23/8, 2016 at 3:7 Comment(3)
Thank you for your incredibly detailed response! This works perfectly.Viscera
This works, but I'm finding a problem: with a much larger dataset of 1000 points in each class, I'm getting NaN's in d. I'm checking the entries of tmp$A and tmp$B using a <- which(is.nan(d), arr.ind=T), and then doing, for instance tmp$A[a[1],] and tmp$A[a[nrow(a) + 1],] and I find that the NaN's seem to occur when the vectors from both classes are identical. Is this a precision thing? Sorry I can't provide a concrete example since my dataset is too big.Viscera
It could be. The only cause for NaNs in the computation is if the square distance turns out to be negative, which can be due to precision. To check, remove the square root and see if the resulting matrix have negative numbers. These should be small. If that is the problem, then simply threshold to zero before taking the square root. If that is not the problem, let me know.Swabber
A
2

Here's an attempt via generating each combination and then simply taking the difference from each value:

abs(matrix(Reduce(`-`, expand.grid(split(df$values, df$class))), nrow=5, byrow=TRUE))
#      [,1]  [,2]  [,3]  [,4]  [,5]
#[1,] 0.003 0.295 0.395 0.003 0.005
#[2,] 0.598 0.300 0.200 0.598 0.590
#[3,] 0.198 0.100 0.200 0.198 0.190
#[4,] 0.001 0.299 0.399 0.001 0.009
#[5,] 0.298 0.000 0.100 0.298 0.290
Area answered 22/8, 2016 at 23:46 Comment(3)
This is perfect! A simple question: what are the possible types of functions that Reduce() can take? For instance, is it possible to take an n-dimensional Euclidian distance, not just 1-D distance (and here that's kind of hacked by taking the absolute value of an absolute difference), in the example above there's only one column of values, but what if I were to have three different values columns, say value1, value2 and value3?Viscera
@Viscera - I don't think this is really extensible to >1D distances unfortunately. I'd have to give it a lot more thought.Area
No problem, it's my fault for not having posted a question about higher-dimensional distance. I'll edit my question accordingly, and see if more solutions come up. Thanks for your 1D solution!Viscera

© 2022 - 2024 — McMap. All rights reserved.