Parallel distance Matrix in R
Asked Answered
S

6

18

currently I'm using the build in function dist to calculate my distance matrix in R.

dist(featureVector,method="manhattan")

This is currently the bottlneck of the application and therefore the idea was to parallize this task(conceptually this should be possible)

Searching google and this forum did not succeed.

Does anybody has an idea?

Sixgun answered 16/6, 2013 at 22:8 Comment(7)
could you provide an example featureVector?Adis
feature vector ist simply a data.frame with 100 columns and about 2000 rows. The columns are the values of the single dimensionsSixgun
takes ~0.05 sec on my machine, how about you? I'm thinking anything parallel might have a longer overhead. And just to make sure, your expected output is a 100-by-100 matrix, right?Groomsman
ok, I now realize you are looking for a 2000-by-2000 matrix. That one is taking 0.6 sec on my machine, so parallelization might be an option.Groomsman
my expected output is a 2000x2000 matrix. So it will need to do 2000*2000*100=400 000 000 operations. right?. If there is such a discrepancy in speed there is probably an error in my code although i cannot see it at the momentSixgun
You should provide complete R code tout create your input data ans output of system.time call to make it easier to reproduce and understand your problem.Binette
reproducible example with benchmark please...Binette
L
4

Here's the structure for one route you could go. It is not faster than just using the dist() function, instead taking many times longer. It does process in parallel, but even if the computation time were reduced to zero, the time to start up the function and export the variables to the cluster would probably be longer than just using dist()

library(parallel)

vec.array <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)

TaxiDistFun <- function(one.vec, whole.matrix) {
    diff.matrix <- t(t(whole.matrix) - one.vec)
    this.row <- apply(diff.matrix, 1, function(x) sum(abs(x)))
    return(this.row)
}

cl <- makeCluster(detectCores())
clusterExport(cl, list("vec.array", "TaxiDistFun"))

system.time(dist.array <- parRapply(cl, vec.array,
                        function(x) TaxiDistFun(x, vec.array)))

stopCluster(cl)

dim(dist.array) <- c(2000, 2000)
Leonoreleonsis answered 1/8, 2013 at 22:0 Comment(0)
S
21

The R package amap provides robust and parallelized functions for Clustering and Principal Component Analysis. Among these functions, Dist method offers what you are looking for: computes and returns the distance matrix in a parallel manner.

Dist(x, method = "euclidean", nbproc = 8)

The code above compute euclidean distance with 8 threads.

Sylviasylviculture answered 10/9, 2014 at 14:8 Comment(5)
R function, amap::Dist function is the version of dist by Multi-thread (parallelisation). I believe it's the best answer! ref: inside-r.org/packages/cran/amap/docs/DistSylviasylviculture
I totally agree, this is the best answer!Courtmartial
Thank you for this answer. I can't, however, figure out whether amap's hcluster() will work with a distance matrix, or does it absolutely require raw data?Hoitytoity
@Hoitytoity from the manual, it only accept the raw data but not the dist data, and the distance matrix should be paralleled calculated internally after you set the nbproc and method parameter. Ref inside-r.org/packages/cran/amap/docs/hclusterSylviasylviculture
Note that per the documentation, this package doesn't parallelize on WindowsSuggestion
L
4

Here's the structure for one route you could go. It is not faster than just using the dist() function, instead taking many times longer. It does process in parallel, but even if the computation time were reduced to zero, the time to start up the function and export the variables to the cluster would probably be longer than just using dist()

library(parallel)

vec.array <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)

TaxiDistFun <- function(one.vec, whole.matrix) {
    diff.matrix <- t(t(whole.matrix) - one.vec)
    this.row <- apply(diff.matrix, 1, function(x) sum(abs(x)))
    return(this.row)
}

cl <- makeCluster(detectCores())
clusterExport(cl, list("vec.array", "TaxiDistFun"))

system.time(dist.array <- parRapply(cl, vec.array,
                        function(x) TaxiDistFun(x, vec.array)))

stopCluster(cl)

dim(dist.array) <- c(2000, 2000)
Leonoreleonsis answered 1/8, 2013 at 22:0 Comment(0)
D
4

You can also use the parDist function of the parallelDist package, which is specifically built for parallelized distance matrix computations. Advantages are that the package is available on Mac OS, Windows and Linux and already supports 39 different distance measures (see parDist).

Performance comparison for manhattan distance (Sys spec: Mac OS; Intel Core i7 with 4 cores @ 2,5 GHz and hyperthreading enabled):

library(parallelDist)
library(amap)
library(wordspace)
library(microbenchmark)

set.seed(123)
x <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)

microbenchmark(parDist(x, method = "manhattan"),
               Dist(x, method = "manhattan", nbproc = 8),
               dist.matrix(x, method = "manhattan"),
               times = 10)

Unit: milliseconds
                                      expr      min       lq     mean   median       uq      max neval
          parDist(x, method = "manhattan") 210.9478 214.3557 225.5894 221.3705 237.9829 247.0844    10
 Dist(x, method = "manhattan", nbproc = 8) 749.9397 755.7351 797.6349 812.6109 824.4075 844.1090    10
      dist.matrix(x, method = "manhattan") 256.0831 263.3273 279.0864 275.1882 296.3256 311.3821    10

With a larger matrix:

x <- matrix(rnorm(10000 * 100), nrow = 10000, ncol = 100)
microbenchmark(parDist(x, method = "manhattan"),
+                Dist(x, method = "manhattan", nbproc = 8),
+                dist.matrix(x, method = "manhattan"),
+                times = 10)
Unit: seconds
                                      expr       min        lq      mean    median        uq       max neval
          parDist(x, method = "manhattan")  6.298234  6.388501  6.737168  6.894203  6.947981  7.221661    10
 Dist(x, method = "manhattan", nbproc = 8) 22.722947 24.113681 24.326157 24.477034 24.658145 25.301353    10
      dist.matrix(x, method = "manhattan")  7.156861  7.505229  7.544352  7.567980  7.655624  7.800530    10

Further performance comparisons can be found in parallelDist's vignette.

Dolley answered 28/6, 2017 at 22:20 Comment(0)
S
2

I am a windows user looking for an efficient way to compute the distance matrix to use it in a hierarchical clustering (using the function hclust from the "stats" package for example). The function Dist doesn't work in parallel in Windows so I had to look for something different, and I found the "wordspace" package of Stefan Evert which contains the dist.matrix function. You can try this code:

X <- data.frame(replicate(1000,sample(0:1,5000,rep=TRUE)))
system.time(d <- dist(X, method = "manhattan"))
system.time(d2 <- as.dist( dist.matrix(as.matrix(X), method="manhattan") ))

As you can see computing the distance matrix for a dataframe with 1000 binary features and 5000 instances is much faster with dist.matrix

These are the results in my laptop (i7-6500U):

> system.time(d <- dist(X, method = "manhattan"))
   user  system elapsed 
 151.79    0.04  152.59 
> system.time(d2 <- as.dist( dist.matrix(as.matrix(X), method="manhattan") ))
   user  system elapsed 
  19.19    0.22   19.56 

This solved my problem. Here you can check the original thread where I found it: http://r.789695.n4.nabble.com/Efficient-distance-calculation-on-big-matrix-td4633598.html

It doesn´t solve it in parallel but is enough in many occasions.

Suffragist answered 4/10, 2016 at 15:19 Comment(1)
dist.matrix is indeed very fast, but does not work with NA's. Is there a solution which also handles NA's?Zymogenic
S
1

I am also working with somewhat large distance matrices and trying to speed-up the computation. Will Benson above is likely to be correct when he says that "the time to start up the function and export the variables to the cluster would probably be longer than just using".

However, I think this applies to distance matrices with small to moderate size. See the example bellow using the functions Dist from the package amap with 10 processors, dist from the package stats, and rdist from package fields, which calls a Fortran function. The first example creates a 400 x 400 distance matrix. The second creates a 3103 x 3103 distance matrix.

require(sp)
require(fields)
require(amap)
data(meuse.grid)
meuse.gridA <- meuse.grid[1:400, 1:2]
meuse.gridB <- meuse.grid[, 1:2]

# small distance matrix
a <- Sys.time()
invisible(dist(meuse.gridA, diag = TRUE, upper = TRUE))
Sys.time() - a
Time difference of 0.002138376 secs
a <- Sys.time()
invisible(Dist(meuse.gridA, nbproc = 10, diag = TRUE, upper = TRUE))
Sys.time() - a
Time difference of 0.005409241 secs
a <- Sys.time()
invisible(rdist(meuse.gridA))
Sys.time() - a
Time difference of 0.02312016 secs

# large distance matrix
a <- Sys.time()
invisible(dist(meuse.gridB, diag = TRUE, upper = TRUE))
Sys.time() - a
Time difference of 0.09845328 secs
a <- Sys.time()
invisible(Dist(meuse.gridB, nbproc = 10, diag = TRUE, upper = TRUE))
Sys.time() - a
Time difference of 0.05900002 secs
a <- Sys.time()
invisible(rdist(meuse.gridB))
Sys.time() - a
Time difference of 0.8928168 secs

Note how the computation time reduced from 0.09845328 secs to 0.05900002 secs using Dist compared to dist when the distance matrix was large (3103 x 3103). As such, I would suggest that you use function Dist from the amap package provided you have several processors available.

Scrapbook answered 23/11, 2014 at 12:58 Comment(0)
E
0

I've found parallelDist to be orders of magnitude faster than dist, and chewing up much less virtual memory in the process, on my Mac under Microsoft R Open 3.4.0. A word of warning though - I've had no luck compiling it on R 3.3.3. It doesn't list the version of R as a dependency but I suspect it is.

Eloyelreath answered 25/8, 2017 at 23:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.