How do I manipulate/access elements of an instance of "dist" class using core R?
Asked Answered
L

13

28

A basic/common class in R is called "dist", and is a relatively efficient representation of a symmetric distance matrix. Unlike a "matrix" object, however, there does not seem to be support for manipulating an "dist" instance by index pairs using the "[" operator.

For example, the following code returns nothing, NULL, or an error:

# First, create an example dist object from a matrix
mat1  <- matrix(1:100, 10, 10)
rownames(mat1) <- 1:10
colnames(mat1) <- 1:10
dist1 <- as.dist(mat1)
# Now try to access index features, or index values
names(dist1)
rownames(dist1)
row.names(dist1)
colnames(dist1)
col.names(dist1)
dist1[1, 2]

Meanwhile, the following commands do work, in some sense, but do not make it any easier to access/manipulate particular index-pair values:

dist1[1] # R thinks of it as a vector, not a matrix?
attributes(dist1)
attributes(dist1)$Diag <- FALSE
mat2 <- as(dist1, "matrix")
mat2[1, 2] <- 0

A workaround -- that I want to avoid -- is to first convert the "dist" object to a "matrix", manipulate that matrix, and then convert it back to "dist". That is also to say, this is not a question about how to convert a "dist" instance into a "matrix", or some other class where common matrix-indexing tools are already defined; since this has been answered in several ways in a different SO question

Are there tools in the stats package (or perhaps some other core R package) dedicated indexing/accessing elements of an instance of "dist"?

Limousin answered 26/3, 2012 at 20:48 Comment(1)
Good Q. Don't have an answer for you, but note that in R a matrix is just a vector with dimensions. So it's unsurprising that dist1[1:20] and dist1[5] <- 100 and so forth work properly. With a little trouble, you could probably write a two-dimensional version, although my familiarity with atomics is limited.Calipash
B
7

I don't have a straight answer to your question, but if you are using the Euclidian distance, have a look at the rdist function from the fields package. Its implementation (in Fortran) is faster than dist, and the output is of class matrix. At the very least, it shows that some developers have chosen to move away from this dist class, maybe for the exact reason you are mentioning. If you are concerned that using a full matrix for storing a symmetric matrix is an inefficient use of memory, you could convert it to a triangular matrix.

library("fields")
points <- matrix(runif(1000*100), nrow=1000, ncol=100)

system.time(dist1 <- dist(points))
#    user  system elapsed 
#   7.277   0.000   7.338 

system.time(dist2 <- rdist(points))
#   user  system elapsed 
#  2.756   0.060   2.851 

class(dist2)
# [1] "matrix"
dim(dist2)
# [1] 1000 1000
dist2[1:3, 1:3]
#              [,1]         [,2]         [,3]
# [1,] 0.0000000001 3.9529674733 3.8051198575
# [2,] 3.9529674733 0.0000000001 3.6552146293
# [3,] 3.8051198575 3.6552146293 0.0000000001
Bellman answered 27/3, 2012 at 0:2 Comment(1)
Thanks! This is useful to know about. And its helpful to know that the basic "dist"-handling tools in R are rather spartan.Thereupon
S
10

There aren't standard ways of doing this, unfortunately. Here's are two functions that convert between the 1D index into the 2D matrix coordinates. They aren't pretty, but they work, and at least you can use the code to make something nicer if you need it. I'm posting it just because the equations aren't obvious.

distdex<-function(i,j,n) #given row, column, and n, return index
    n*(i-1) - i*(i-1)/2 + j-i

rowcol<-function(ix,n) { #given index, return row and column
    nr=ceiling(n-(1+sqrt(1+4*(n^2-n-2*ix)))/2)
    nc=n-(2*n-nr+1)*nr/2+ix+nr
    cbind(nr,nc)
}

A little test harness to show it works:

dist(rnorm(20))->testd
as.matrix(testd)[7,13]   #row<col
distdex(7,13,20) # =105
testd[105]   #same as above

testd[c(42,119)]
rowcol(c(42,119),20)  # = (3,8) and (8,15)
as.matrix(testd)[3,8]
as.matrix(testd)[8,15]
Sorus answered 28/9, 2012 at 16:5 Comment(2)
This is a mostly useful answer, but given the intended application it requires some clarification. It only works if i < j. For i = j and i > j it returns the wrong answer. Modifying the distdex function to return 0 when i == j and to transpose i and j when i > j solves the problem I put the code in my response below so others could just copy and paste. To be clear this is only an issue with non-standard queries of the distance matrix, so not a dig on Christian A's answer just a clarification.Hobson
Here is much better implementation: R - How to get row & column subscripts of matched elements from a distance matrix.System
S
8

as.matrix(d) will turn the dist object d into a matrix, while as.dist(m) will turn the matrix m back into a dist object. Note that the latter doesn't actually check that m is a valid distance matrix; it just extracts the lower triangular part.

Symphony answered 27/3, 2012 at 1:10 Comment(3)
This answer was already mentioned at the end of my question (Last paragraph, and as(dist1, "matrix") in example code. I am wondering if there is an "in place" solution -- separate from this workaround -- supported by the "dist" class in R. Thanks for the comment about as.dist not checking on the validity of the dist instance.Thereupon
I'm not seeing the distinction. The "in place" solution for matrix-style indexing is to use as.matrix on a dist object: the as.matrix generic calls stats:::as.matrix.dist, which is the method for the dist class.Symphony
I attempted to clarify the end of the question. I think the confusion stems from the ambiguity of the phrase "in place", which I suppose could refer to either (1) any tools already in R, including approaches that are undesirable for some reason, like the class flip-flop that I want to avoid; or (2) a tool dedicated for this purpose and class that will not change the class of my object to complete the task, even temporarily.Thereupon
B
7

I don't have a straight answer to your question, but if you are using the Euclidian distance, have a look at the rdist function from the fields package. Its implementation (in Fortran) is faster than dist, and the output is of class matrix. At the very least, it shows that some developers have chosen to move away from this dist class, maybe for the exact reason you are mentioning. If you are concerned that using a full matrix for storing a symmetric matrix is an inefficient use of memory, you could convert it to a triangular matrix.

library("fields")
points <- matrix(runif(1000*100), nrow=1000, ncol=100)

system.time(dist1 <- dist(points))
#    user  system elapsed 
#   7.277   0.000   7.338 

system.time(dist2 <- rdist(points))
#   user  system elapsed 
#  2.756   0.060   2.851 

class(dist2)
# [1] "matrix"
dim(dist2)
# [1] 1000 1000
dist2[1:3, 1:3]
#              [,1]         [,2]         [,3]
# [1,] 0.0000000001 3.9529674733 3.8051198575
# [2,] 3.9529674733 0.0000000001 3.6552146293
# [3,] 3.8051198575 3.6552146293 0.0000000001
Bellman answered 27/3, 2012 at 0:2 Comment(1)
Thanks! This is useful to know about. And its helpful to know that the basic "dist"-handling tools in R are rather spartan.Thereupon
T
4

You can acces the atributes of any object with str()

for a "dist" object of some of my data (dist1), it look like this:

> str(dist1)
Class 'dist'  atomic [1:4560] 7.3 7.43 7.97 7.74 7.55 ...
  ..- attr(*, "Size")= int 96
  ..- attr(*, "Labels")= chr [1:96] "1" "2" "3" "4" ...
  ..- attr(*, "Diag")= logi FALSE
  ..- attr(*, "Upper")= logi FALSE
  ..- attr(*, "method")= chr "euclidean"
  ..- attr(*, "call")= language dist(x = dist1) 

you can see that for this particular data set, the "Labels" attribute is a character string of length = 96 with numbers from 1 to 96 as characters.

you could change directly that character string doing:

> attr(dist1,"Labels") <- your.labels

"your.labels" should be some id. or factor vector, presumably in the original data from with the "dist" object was made.

Twohanded answered 9/6, 2015 at 4:13 Comment(0)
N
1

Seems dist objects are treated pretty much the same way as simple vector objects. As far as I can see its a vector with attributes. So to get the values out:

x = as.vector(distobject)

See? dist for a formula to extract the distance between a specific pair of objects using their indices.

Noami answered 13/5, 2013 at 14:29 Comment(2)
This (coercion) was already described in other answers and was noted in the question as something I'm trying to avoid. Would also prefer some code that attempts at least some type of extraction or assignment in matrix style "[" notation to be a valuable answer.Thereupon
as.vector is about 30x faster than as.matrix for meCorrell
L
1

You may find this useful [from ??dist]:

The lower triangle of the distance matrix stored by columns in a vector, say ‘do’. If ‘n’ is the number of observations, i.e., ‘n <- attr(do, "Size")’, then for i < j <= n, the dissimilarity between (row) i and j is ‘do[n*(i-1) - i*(i-1)/2 + j-i]’. The length of the vector is n*(n-1)/2, i.e., of order n^2.

Landmeier answered 1/5, 2014 at 14:25 Comment(1)
My partial answer has included that formula for a long time. Checking the doc at ?dist was the first thing I did, long before posting this question to SO.Thereupon
H
1

This response is really just an extended follow on to Christian A's earlier response. It is warranted because some readers of the question (myself included) may query the dist object as if it were symmetric ( not just (7,13) as below but also (13,7). I don't have edit privileges and the earlier answer was correct as long as the user was treating the dist object as a dist object and not a sparse matrix which is why I have a separate response rather than an edit. Vote up Christian A for doing the heavy lifting if this answer is useful. The original answer with my edits pasted in :

distdex<-function(i,j,n) #given row, column, and n, return index
    n*(i-1) - i*(i-1)/2 + j-i

rowcol<-function(ix,n) { #given index, return row and column
    nr=ceiling(n-(1+sqrt(1+4*(n^2-n-2*ix)))/2)
    nc=n-(2*n-nr+1)*nr/2+ix+nr
    cbind(nr,nc)
}
#A little test harness to show it works:

dist(rnorm(20))->testd
as.matrix(testd)[7,13]   #row<col
distdex(7,13,20) # =105
testd[105]   #same as above

But...

distdex(13,7,20) # =156
testd[156]   #the wrong answer

Christian A's function only works if i < j. For i = j and i > j it returns the wrong answer. Modifying the distdex function to return 0 when i == j and to transpose i and j when i > j solves the problem so:

distdex2<-function(i,j,n){ #given row, column, and n, return index
  if(i==j){0
  }else if(i > j){
    n*(j-1) - j*(j-1)/2 + i-j
  }else{
    n*(i-1) - i*(i-1)/2 + j-i  
  }
}

as.matrix(testd)[7,13]   #row<col
distdex2(7,13,20) # =105
testd[105]   #same as above
distdex2(13,7,20) # =105
testd[105]   #the same answer
Hobson answered 4/6, 2015 at 15:41 Comment(1)
Here is much better implementation: R - How to get row & column subscripts of matched elements from a distance matrix.System
X
0

You could do this:

d <- function(distance, selection){
  eval(parse(text = paste("as.matrix(distance)[",
               selection, "]")))
}

`d<-` <- function(distance, selection, value){
  eval(parse(text = paste("as.matrix(distance)[",
               selection, "] <- value")))
  as.dist(distance)
}

Which would allow you to do this:

 mat <- matrix(1:12, nrow=4)
 mat.d <- dist(mat)
 mat.d
        1   2   3
    2 1.7        
    3 3.5 1.7    
    4 5.2 3.5 1.7

 d(mat.d, "3, 2")
    [1] 1.7
 d(mat.d, "3, 2") <- 200
 mat.d
          1     2     3
    2   1.7            
    3   3.5 200.0      
    4   5.2   3.5   1.7

However, any changes you make to the diagonal or upper triangle are ignored. That may or may not be the right thing to do. If it isn't, you'll need to add some kind of sanity check or appropriate handling for those cases. And probably others.

Xuthus answered 28/3, 2012 at 18:41 Comment(2)
Thanks Tyler. This does still seem to be a class flip-flop (albeit clever and useful), which may also have the potential to kill some of the potentially useful attributes in your original "dist" instance, like $call. I'm curious what you think of the answer I also just posted below, which includes a working accessor that doesn't modify the class, as well as a non-working replacement function that I haven't solved yet.Thereupon
@Paul, your solution looks good, although it returns the wrong value for diagonals for some reason. I don't know why the replacement function doesn't work.Xuthus
L
0

There do not seem to be tools in stats package for this. Thanks to @flodel for an alternative implementation in a non-core package.

I dug into the definition of the "dist" class in the core R source, which is old-school S3 with no tools in the dist.R source file like what I'm asking about in this question.

The documentation of the dist() function does point out, usefully, that (and I quote):

The lower triangle of the distance matrix stored by columns in a vector, say do. If n is the number of observations, i.e., n <- attr(do, "Size"), then for i < j ≤ n, the dissimilarity between (row) i and j is:

do[n*(i-1) - i*(i-1)/2 + j-i]

The length of the vector is n*(n-1)/2, i.e., of order n^2.

(end quote)

I took advantage of this in the following example code for a define-yourself "dist" accessor. Note that this example can only return one value at a time.

################################################################################
# Define dist accessor
################################################################################
setOldClass("dist")
getDistIndex <- function(x, i, j){
    n <- attr(x, "Size")
    if( class(i) == "character"){ i <- which(i[1] == attr(x, "Labels")) }
    if( class(j) == "character"){ j <- which(j[1] == attr(x, "Labels")) }
    # switch indices (symmetric) if i is bigger than j
    if( i > j ){
        i0 <- i
        i  <- j
        j  <- i0
    }
    # for i < j <= n
    return( n*(i-1) - i*(i-1)/2 + j-i )
}
# Define the accessor
"[.dist" <- function(x, i, j, ...){
    x[[getDistIndex(x, i, j)]]
}
################################################################################

And this seems to work fine, as expected. However, I'm having trouble getting the replacement function to work.

################################################################################
# Define the replacement function
################################################################################
"[.dist<-" <- function(x, i, j, value){
    x[[get.dist.index(x, i, j)]] <- value
    return(x)
}
################################################################################

A test-run of this new assignment operator

dist1["5", "3"] <- 7000

Returns:

"R> Error in dist1["5", "3"] <- 7000 : incorrect number of subscripts on matrix"

As-asked, I think @flodel answered the question better, but still thought this "answer" might also be useful.

I also found some nice S4 examples of square-bracket accessor and replacement definitions in the Matrix package, which could be adapted from this current example pretty easily.

Limousin answered 28/3, 2012 at 18:43 Comment(0)
S
0

Converting to a matrix was also out of question for me, because the resulting matrix would be 35K by 35K, so I left it as a vector (result of dist) and wrote a function to find the place in the vector where the distance should be:

distXY <- function(X,Y,n){
  A=min(X,Y)
  B=max(X,Y)

  d=eval(parse(text=
               paste0("(A-1)*n  -",paste0((1:(A-1)),collapse="-"),"+ B-A")))

  return(d)

}

Where you provide X and Y, the original rows of the elements in the matrix from which you calculated dist, and n is the total number of elements in that matrix. The result is the position in the dist vector where the distance will be. I hope it makes sense.

Simarouba answered 29/5, 2015 at 13:38 Comment(0)
C
0

disto package provides a class that wraps distance matrices in R (in-memory and out-of-core) and provides much more than the convenience operators like [. Please check the vignette here.

PS: I am the author of the package.

Cornstalk answered 2/3, 2019 at 19:51 Comment(0)
A
0

Here's my practical solution for getting values from of a dist object by name. Want to get item 9 as a vector of values?

as.matrix(mat1)[grepl("9", labels(mat1))]
Albatross answered 26/10, 2019 at 17:41 Comment(0)
M
0

If you want to change only distance values (not attributes) in a dist object, you can replace full matrix just running

odo[]<-ndo[]

where odo is the original dist object, and ndo is the new dist object,

created by coering a (squared) matrix into dist object using as.dist

Milena answered 20/9, 2022 at 19:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.