Big data ways to calculate sets of distances in R?

Asked 17/12, 2021 at 16:31 Answered 17/12, 2021 at 19:8

Problem: We need a big data method for calculating distances between points. We outline what we'd like to do below with a five-observation dataframe. However, this particular method is infeasible as the number of rows gets large (> 1 million). In the past, we've used SAS to do this kind of analysis, but we'd prefer R if possible. (Note: I'm not going to show code because, while I outline a way to do this on smaller datasets below, this is basically an impossible method to use with data on our scale.)

We start with a dataframe of stores, each of which has a latitude and longitude (though this is not a spatial file, nor do we want to use a spatial file).

# you can think of x and y in this example as Cartesian coordinates
stores <- data.frame(id = 1:5,
                     x = c(1, 0, 1, 2, 0),
                     y = c(1, 2, 0, 2, 0))

stores
  id x y
1  1 1 1
2  2 0 2
3  3 1 0
4  4 2 2
5  5 0 0

For each store, we want to know the number of stores within x distance. In a small dataframe, this is straightforward. Create another dataframe of all coordinates, merge back in, calculate distances, create an indicator if the distance is less than x and add up the indicators (minus one for the store itself, which is at distance 0). This would result in a dataset that looks like this:

   id x y  s1.dist  s2.dist  s3.dist  s4.dist  s5.dist
1:  1 1 1 0.000000 1.414214 1.000000 1.414214 1.414214
2:  2 0 2 1.414214 0.000000 2.236068 2.000000 2.000000
3:  3 1 0 1.000000 2.236068 0.000000 2.236068 1.000000
4:  4 2 2 1.414214 2.000000 2.236068 0.000000 2.828427
5:  5 0 0 1.414214 2.000000 1.000000 2.828427 0.000000

When you count (arbitrarily) under 1.45 as "close," you end up with indicators that look like this:

# don't include the store itself in the total
   id x y s1.close s2.close s3.close s4.close s5.close total.close
1:  1 1 1        1        1        1        1        1           4
2:  2 0 2        1        1        0        0        0           1
3:  3 1 0        1        0        1        0        1           2
4:  4 2 2        1        0        0        1        0           1
5:  5 0 0        1        0        1        0        1           2

The final product should look like this:

   id total.close
1:  1           4
2:  2           1
3:  3           2
4:  4           1
5:  5           2

All advice appreciated.

Thank you very much

Cuffs answered 17/12, 2021 at 16:31 Comment(5)

Exactly how large is the data set? Is the data set too big to bring into R? There are options for data stores with Hadoop and other distributed storage systems. If the data can be fully imported into R, there are many options. You can read about some of these options here. – Swisher 17/12, 2021 at 16:53

The data are already on a HPCC. The issue is that to create the kind of matrix I describe above, it'd be something like a 1,000,000 x 1,000,000 dataframe, which even with parallelization and HPCs isn't ideal. Correct me if I've misunderstood what you're suggesting, though. – Cuffs 17/12, 2021 at 17:4

I should also add that we're using confidential data and so are restricted in what packages we can use or add. Anything connecting to the internet is not allowed, which would seem to preclude Hadoop, if I'm understanding the documentation correctly. – Cuffs 17/12, 2021 at 17:13

R can handle about 2M rows (or columns), so you will have to employ methods like clustering from the HPC. However, since the actions you're trying to take aren't particularly complicated, you may find that the data.table package is your best bet. I'm not sure what metric you're looking for between coordinates (i.e., haversine, Vincenty, euclidean, etc.) or the scale (i.e., miles, kilometers, etc.), I can't offer much more than a package name! – Swisher 17/12, 2021 at 17:40

1million points? Thats too much noting that you would have to compute n(n-1)/2 distances, ie ~500 billion distances – Flamenco 17/12, 2021 at 23:18

Any reason you can't loop instead of making it one big calculation?

stores <- data.frame(id = 1:5,
                     x = c(1, 0, 1, 2, 0),
                     y = c(1, 2, 0, 2, 0))

# Here's a Euclidean distance metric, but you can drop anything you want in here
distfun <- function(x0, y0, x1, y1){
  sqrt((x1-x0)^2+(y1-y0)^2)
}

# Loop over each store
t(sapply(seq_len(nrow(stores)), function(i){
  distances <- distfun(x0 = stores$x[i], x1 = stores$x,
                       y0 = stores$y[i], y1 = stores$y)
  # Calculate number less than arbitrary cutoff, subtract one for self
  num_within <- sum(distances<1.45)-1
  c(stores$id[i], num_within)
}))

Produces:

     [,1] [,2]
[1,]    1    4
[2,]    2    1
[3,]    3    2
[4,]    4    1
[5,]    5    2

This will work with a data set of any size that you can bring into R, but it'll just get slower as the size increases. Here's a test on 10,000 entries that runs in a couple seconds on my machine:

stores <- data.frame(id=1:10000, 
                     x=runif(10000, max = 10), 
                     y=runif(10000, max = 10))

          [,1] [,2]
    [1,]     1  679
    [2,]     2  698
    [3,]     3  618
    [4,]     4  434
    [5,]     5  402
...
 [9995,]  9995  529
 [9996,]  9996  626
 [9997,]  9997  649
 [9998,]  9998  514
 [9999,]  9999  667
[10000,] 10000  603

It get slower with more calculations (because it has to run between every pair of points, this will always be O(n^2)) but without knowing the actual distance metric you'd like to calculate we can't optimize the slow part any further.

Periwinkle answered 17/12, 2021 at 18:8 Comment(3)

This is the same as doing the whole computation in a vectorized format. You are still repeating computations. eg once you have computed the distance between 1 and 2, you again compute the distance between 2 and 1 which kind of makes the time complexity of this function to be in the O(n^2). And that my friend will not work in 1million+ rows – Flamenco 17/12, 2021 at 22:1

@Flamenco yep, agreed - but at least with a time complexity of O(n^2) it's doable (maybe once to create a database, rather than something interactive?), while a memory complexity of O(n^2) will require hardware that simply doesn't exist yet - see my comment on jay's answer for an estimate of ~4TB of RAM required for 1M rows – Periwinkle 17/12, 2021 at 22:22

Also, the distance matrix isn't guaranteed to be symmetrical - here in Euclidean space it is, but in many areas of research the distance between A and B isn't always the same as the distance between B and A, and there's no way to avoid "repeating" calculations if that's the case. – Periwinkle 17/12, 2021 at 22:23

Did you really already try the classic dist() function? The core is implemented in C and should thus be fast.

Probably the coercion to a matrix (which takes place in dist anyway) already costs a lot of time, maybe it could be read in immediately as a matrix and not first as a data frame.

M <- as.matrix(stores[-1])

dist(M, diag=TRUE, upper=TRUE)
#          1        2        3        4        5
# 1 0.000000 1.414214 1.000000 1.414214 1.414214
# 2 1.414214 0.000000 2.236068 2.000000 2.000000
# 3 1.000000 2.236068 0.000000 2.236068 1.000000
# 4 1.414214 2.000000 2.236068 0.000000 2.828427
# 5 1.414214 2.000000 1.000000 2.828427 0.000000

Otherwise you could try this C++ implementation which is basically a copy of @coatless's code. However, I used Rcpp package for use in an R script.

library(Rcpp)
cppFunction('Rcpp::NumericMatrix calcPWD1 (const Rcpp::NumericMatrix & x){
  unsigned int outrows = x.nrow(), i = 0, j = 0;
  double d;
  Rcpp::NumericMatrix out(outrows,outrows);

  for (i = 0; i < outrows - 1; i++){
    Rcpp::NumericVector v1 = x.row(i);
    for (j = i + 1; j < outrows ; j ++){
      d = sqrt(sum(pow(v1-x.row(j), 2.0)));
      out(j,i)=d;
      out(i,j)=d;
    }
  }

  return out;
}')

calcPWD1(M)
#          [,1]     [,2]     [,3]     [,4]     [,5]
# [1,] 0.000000 1.414214 1.000000 1.414214 1.414214
# [2,] 1.414214 0.000000 2.236068 2.000000 2.000000
# [3,] 1.000000 2.236068 0.000000 2.236068 1.000000
# [4,] 1.414214 2.000000 2.236068 0.000000 2.828427
# [5,] 1.414214 2.000000 1.000000 2.828427 0.000000

However, the benchmark is yet clearly in favor of dist, so you should give it a try:

M_big <- M[sample(nrow(M), 1e4, replace=TRUE), ]  ## inflate to 10k rows
microbenchmark::microbenchmark(
  dist=dist(M_big, diag=TRUE, upper=TRUE),
  calcPWD1=calcPWD1(M_big),
  control=list(warmup=10L),
  times=3L
)
# Unit: milliseconds
#     expr       min        lq     mean   median        uq       max neval cld
#     dist  640.1861  660.1396  765.881  680.093  828.7284  977.3638     3  a 
# calcPWD1 1419.4106 1439.1353 1505.253 1458.860 1548.1736 1637.4873     3   b

Be sure to read @coatless's and Dirk Eddelbuettel's answers where they write some more about C, C++, and R and have other versions of the function.

Powdery answered 17/12, 2021 at 19:8 Comment(2)

dist will definitely break with a million entries! Running it in your example with 10k entries already occupies ~400MB in memory, with an expected increase to 40GB at 100k and 4TB of memory required at OP's 1M rows. – Periwinkle 17/12, 2021 at 19:24

I think the question rather is if dist would break or RAM is insufficient. – Powdery 17/12, 2021 at 21:11

Recommended topics

Hot tags