Memory-efficient method to create dist object from distance matrix
Asked Answered
N

2

4

I'm trying to create dist objects from large distance matrices. I'm running out of memory using stats::as.dist. For example, I have around 128 Gb available on current machine but as.dist runs out of memory when processing a 73,000 x 73,000 matrix (approx 42Gb). Given that the final dist object should be less than half the size of the matrix (i.e. it is the lower triangle, stored as a vector) it seems to me that it should be possible to do this calculation in a more memory-efficient way - provided we are careful not to create large intermediate objects and just copy the relevant elements of the input directly to the output.

Looking at the code for getS3method('as.dist', 'default'), I see that it does the calculation using ans <- m[row(m) > col(m)] which requires the creation of row and col matrices of the same dimensionality as the input.

I thought I might be able to improve on this using the algorithm from here to generate the indexes of the lower triangle. Here is my attempt using this method.

as.dist.new = function(dm, diag = FALSE, upper = FALSE) {
  n = dim(dm)[1]
  stopifnot(is.matrix(dm))
  stopifnot(dim(dm)[2] == n)
  k = 1:((n^2 - n)/2)
  j <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k - 1))) / 2)
  i <- j + k - (2 * n - j) * (j - 1) / 2
  idx = cbind(i,j)
  remove(i,j,k)
  gc()
  d = dm[idx]

  class(d) <- "dist"
  attr(d, "Size") <- n
  attr(d, "call") <- match.call()
  attr(d, "Diag") <- diag
  attr(d, "Upper") <- upper
  d
}

This works fine on smaller matrices. Here's a simple example:

N = 10
dm = matrix(runif(N*N), N, N) 
diag(dm) = 0

x = as.dist(dm)
y = as.dist.new(dm)

However, if we create a larger distance matrix it runs into the same memory issues as as.dist.

E.g. this version crashes on my system:

N = 73000
dm = matrix(runif(N*N), N, N)
gc() 
diag(dm) = 0
gc()

as.dist.new(dm)

Does anyone have a suggestion how to perform this operation more efficiently? R or Rcpp solutions welcome. NB looking at this answer to a related problem (generating the full distance matrix from 2-column location data) it seems that it may be possible to do this using RcppArmadillo, but I've got no experience of using that.

Nellanellda answered 25/1, 2020 at 6:52 Comment(6)
Please try @mharinga's solution as suggested here: #37407565Haunted
Thanks for the link @Haunted - that package in that answer could be useful to generate distance matrices more quickly. But what I am doing is different. It is converting a distance matrix into a dist object.Nellanellda
There is no surprise here. You have 42GB matrix + 21GB vector, that is already a 63GB. You can try to have your matrix on disk using e.g. FBMs from package {bigstatsr} (disclaimer: I'm the author).Oira
Thanks for the suggestion @F.Privé - but my problem is not lack of disk space (I could of course just increase the swap size), but rather that I should have already have enough memory available for the matrix plus vector. So looking for more efficient solutions that limit the amount of disk swapping required.Nellanellda
You do NOT have enough RAM for the 2 objects + other stuff. This is why I recommend you to have the matrix on disk instead.Oira
I appreciate that your package could help others so the comments are welcome. However for me "recommend you to have the matrix on disk instead" is exactly what the swap space does on Linux. Effectively I have 128 Gb working memory available. The operation does not fail when it crosses the 64 Gb barrier. It fails when it goes over 128. In any case, I have a more efficient solution now (as per my answer below) so there is no need to put the matrix on disk at all.Nellanellda
N
2

My current solution is to calculate the dist object directly from lat and lon vectors, without generating the intermediate distance matrix at all. On large matrices, this is several hundred times faster than the "conventional" pipeline of geosphere::mdist() followed by stats::as.dist() and uses only as much memory as required to store the final dist object.

The following Rcpp source is based on using the functions from here to calculate haversine distance in c++, together with an adaptation of @Alexis method to iterate through the lower triangle elements in c++.

#include <Rcpp.h>
using namespace Rcpp;

double distanceHaversine(double latf, double lonf, double latt, double lont, double tolerance){
  double d;
  double dlat = latt - latf;
  double dlon =  lont - lonf;
  d = (sin(dlat * 0.5) * sin(dlat * 0.5)) + (cos(latf) * cos(latt)) * (sin(dlon * 0.5) * sin(dlon * 0.5));
  if(d > 1 && d <= tolerance){
    d = 1;
  }
  return 2 * atan2(sqrt(d), sqrt(1 - d)) * 6378137.0;
}

double toRadians(double deg){
  return deg * 0.01745329251;  // PI / 180;
}

//-----------------------------------------------------------
// [[Rcpp::export]]
NumericVector calc_dist(Rcpp::NumericVector lat, 
                        Rcpp::NumericVector lon, 
                        double tolerance = 10000000000.0) {
  std::size_t nlat = lat.size();
  std::size_t nlon = lon.size();
  if (nlat != nlon) throw std::range_error("lat and lon different lengths");
  if (nlat < 2) throw std::range_error("Need at least 2 points");
  std::size_t size = nlat * (nlat - 1) / 2;
  NumericVector ans(size);
  std::size_t k = 0;
  double latf;
  double latt;
  double lonf;
  double lont;
  
  for (std::size_t j = 0; j < (nlat-1); j++) {
    for (std::size_t i = j + 1; i < nlat; i++) {
      latf = toRadians(lat[i]);
      lonf = toRadians(lon[i]);
      latt = toRadians(lat[j]);
      lont = toRadians(lon[j]);
      ans[k++] = distanceHaversine(latf, lonf, latt, lont, tolerance);
    }
  }
  
  return ans;
}

/*** R
as_dist = function(lat, lon, tolerance  = 10000000000.0) {
  dd = calc_dist(lat, lon, tolerance)
  attr(dd, "class") = "dist"
  attr(dd, "Size") = length(lat)
  attr(dd, "call") = match.call()
  attr(dd, "Diag") = FALSE
  attr(dd, "Upper") = FALSE
  return(dd)
}
*/
Nellanellda answered 26/1, 2020 at 3:50 Comment(0)
C
2

Well, the logic to traverse the lower triangular is relatively simple, and if you do it in C++ then it can also be fast:

library(Rcpp)

sourceCpp(code='
// [[Rcpp::plugins(cpp11)]]

#include <cstddef> // size_t

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector as_dist(const NumericMatrix& mat) {
  std::size_t nrow = mat.nrow();
  std::size_t ncol = mat.ncol();
  std::size_t size = nrow * (nrow - 1) / 2;
  NumericVector ans(size);

  if (nrow > 1) {
    std::size_t k = 0;
    for (std::size_t j = 0; j < ncol; j++) {
      for (std::size_t i = j + 1; i < nrow; i++) {
        ans[k++] = mat(i,j);
      }
    }
  }

  ans.attr("class") = "dist";
  ans.attr("Size") = nrow;
  ans.attr("Diag") = false;
  ans.attr("Upper") = false;
  return ans;
}
')

as_dist(matrix(1:100, 10, 10))
    1  2  3  4  5  6  7  8  9
2   2                        
3   3 13                     
4   4 14 24                  
5   5 15 25 35               
6   6 16 26 36 46            
7   7 17 27 37 47 57         
8   8 18 28 38 48 58 68      
9   9 19 29 39 49 59 69 79   
10 10 20 30 40 50 60 70 80 90
Cantal answered 25/1, 2020 at 12:15 Comment(5)
Thanks. Works well on smaller matrices. But runs into the same memory issues on larger ones. Although I confess I don;t know why - it looks as though it should work. I'm guessing it has something to do with R creating copies of objects that are passed or returned. But I thought that only pointers are passed between R and C++, so I'm confused why this should be the case.Nellanellda
@Nellanellda weird, what's the exact error message you're getting? And which OS are you using?Cantal
I see the memory (both RAM and swap) fill up gradually using system monitor until R session crashes once both are at 100%. Using Linux (Pop!OS) on 64Gb Ram+ 64Gb Swap.Nellanellda
@Nellanellda not sure what could be wrong, maybe try adding some good old Rcout << "foo" debugging statements to see if it actually runs the code or breaks as soon as ans is allocated?Cantal
I've changed approach a bit and am now using a function (based loosely on your answer) that calculates the dist object directly from lat and lon vectors, without generating the intermediate distance matrix at all. This seems the best way forwards. I'll try your debugging suggestion later, as it is also interesting to understand why this version failed. One possibility - shouldn't j iterate from 0 to (ncol-1)? running the loop up to ncol could go out of bounds. Although I'd expect this to throw an out of bounds error rather than just run out of memory.Nellanellda
N
2

My current solution is to calculate the dist object directly from lat and lon vectors, without generating the intermediate distance matrix at all. On large matrices, this is several hundred times faster than the "conventional" pipeline of geosphere::mdist() followed by stats::as.dist() and uses only as much memory as required to store the final dist object.

The following Rcpp source is based on using the functions from here to calculate haversine distance in c++, together with an adaptation of @Alexis method to iterate through the lower triangle elements in c++.

#include <Rcpp.h>
using namespace Rcpp;

double distanceHaversine(double latf, double lonf, double latt, double lont, double tolerance){
  double d;
  double dlat = latt - latf;
  double dlon =  lont - lonf;
  d = (sin(dlat * 0.5) * sin(dlat * 0.5)) + (cos(latf) * cos(latt)) * (sin(dlon * 0.5) * sin(dlon * 0.5));
  if(d > 1 && d <= tolerance){
    d = 1;
  }
  return 2 * atan2(sqrt(d), sqrt(1 - d)) * 6378137.0;
}

double toRadians(double deg){
  return deg * 0.01745329251;  // PI / 180;
}

//-----------------------------------------------------------
// [[Rcpp::export]]
NumericVector calc_dist(Rcpp::NumericVector lat, 
                        Rcpp::NumericVector lon, 
                        double tolerance = 10000000000.0) {
  std::size_t nlat = lat.size();
  std::size_t nlon = lon.size();
  if (nlat != nlon) throw std::range_error("lat and lon different lengths");
  if (nlat < 2) throw std::range_error("Need at least 2 points");
  std::size_t size = nlat * (nlat - 1) / 2;
  NumericVector ans(size);
  std::size_t k = 0;
  double latf;
  double latt;
  double lonf;
  double lont;
  
  for (std::size_t j = 0; j < (nlat-1); j++) {
    for (std::size_t i = j + 1; i < nlat; i++) {
      latf = toRadians(lat[i]);
      lonf = toRadians(lon[i]);
      latt = toRadians(lat[j]);
      lont = toRadians(lon[j]);
      ans[k++] = distanceHaversine(latf, lonf, latt, lont, tolerance);
    }
  }
  
  return ans;
}

/*** R
as_dist = function(lat, lon, tolerance  = 10000000000.0) {
  dd = calc_dist(lat, lon, tolerance)
  attr(dd, "class") = "dist"
  attr(dd, "Size") = length(lat)
  attr(dd, "call") = match.call()
  attr(dd, "Diag") = FALSE
  attr(dd, "Upper") = FALSE
  return(dd)
}
*/
Nellanellda answered 26/1, 2020 at 3:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.