Closest point to a path

Asked 5/12, 2014 at 17:52 Answered 8/12, 2014 at 16:50

Solved r performance spatial rcpp rcpp11

I have two sets of points, called path and centers. For each point in path, I would like an efficient method for finding the ID of the closest point in centers. I would like to do this in R. Below is a simple reproducible example.

set.seed(1)
n <- 10000
x <- 100*cumprod(1 + rnorm(n, 0.0001, 0.002))
y <- 50*cumprod(1 + rnorm(n, 0.0001, 0.002))

path <- data.frame(cbind(x=x, y=y))

centers <- expand.grid(x=seq(0, 500,by=0.5) + rnorm(1001), 
                       y=seq(0, 500, by=0.2) + rnorm(2501))

centers$id <- seq(nrow(centers))

x and y are coordinates. I would like to add a column to the path data.frame that has the id of the closest center for the given x and y co-ordinate. I then want to get all of the unique ids.

My solution at the moment does work, but is very slow when the scale of the problem increases. I would like something much more efficient.

path$closest.id <- sapply(seq(nrow(path)), function(z){
   tmp <- ((centers$x - path[z, 'x'])^2) + ((centers$y - path[z, 'y'])^2)
   as.numeric(centers[tmp == min(tmp), 'id'])
})

output <- unique(path$closest.id)

Any help on speeding this up would be greatly appreciated.

I think data.table might help, but ideally what I am looking for is an algorithm that is perhaps a bit smarter in terms of the search, i.e. instead of calculating the distances to each center and then only selecting the minimum one... to get the id...

I am also happy to use Rcpp/Rcpp11 as well if that would help improve performance.

My minimum acceptable time to perform this kind of calculation out would be 10 seconds, but obviously faster would be better.

Spherical answered 5/12, 2014 at 17:52 Comment(11)

What is the scale of your actual problem? What is n and nrow(centers)? – Nuclease 5/12, 2014 at 19:32

n is 10000, nrow(centers) is 1000000 – Spherical 6/12, 2014 at 11:59

What resources do you have available? OS, cores, etc? Also, the sample data only ever gives a single 'id'; in the real data do you have multiple 'id's getting returned for a single iteration? Lastly, your comment indicates nrow(centers) at 1M but the sample data has it at 5M which is it and what is your timing currently (on the sample data)? – Jadwiga 7/12, 2014 at 20:9

I have a Windows 7 Enterprise Service Pack 1, running on Intel Core i5-3320M CPU @ 2.60GHz, 4GB RAM, 32 bit OS. I would expect only a single id for each iteration, never more than that...and yes the timing is on the sample data...as I would like the solution to be much more efficient than 10 seconds... – Spherical 7/12, 2014 at 21:50

What is your current timing on the sample data? – Jadwiga 7/12, 2014 at 23:10

If using sapply(seq(100),function(z){ it takes 9.11 seconds, so linearly extrapolating the sample data which has 10000 it should take 911 seconds which is about 15 minutes 11 seconds... – Spherical 7/12, 2014 at 23:19

Now doing the full list set took 952.15 seconds, which is 15 minutes 52 seconds – Spherical 7/12, 2014 at 23:34

I get a speedup of about a factor 2 by substituting the line as.numeric(centers[tmp==min(tmp),'id']) with which.min(tmp), achieving the same result. But I gather you need a speedup of 100, so I have 50 still to go... – Claiborn 7/12, 2014 at 23:47

Is your example, is centers representative of your true centers object? i.e. can we assume that for each y, a point exists at each x? – Copyedit 8/12, 2014 at 13:5

Yes each center will have an x and y value.... – Spherical 8/12, 2014 at 13:17

You misunderstand me - does centers include all xy combinations of your set of unique x and your set of unique y? i.e., like the output of expand.grid. – Copyedit 8/12, 2014 at 13:35

You can do this with nn2 from the RANN package. On my system, this computes the nearest center to each of your path points in under 2 seconds.

library(RANN)
system.time(closest <- nn2(centers[, 1:2], path, 1))

#   user  system elapsed 
#   1.41    0.14    1.55 



sapply(closest, head)

#      nn.idx   nn.dists
# [1,] 247451 0.20334929
# [2,] 250454 0.12326323
# [3,] 250454 0.28540127
# [4,] 253457 0.05178687
# [5,] 253457 0.13324137
# [6,] 253457 0.09009626

Here's another example with 2.5 million candidate points that all fall within the extent of the path points (in your example, the centers have a much larger x and y range than do the path points). It's a little slower in this case.

set.seed(1)
centers2 <- cbind(runif(2.5e6, min(x), max(x)), runif(2.5e6, min(y), max(y)))
system.time(closest2 <- nn2(centers2, path, 1))

#   user  system elapsed 
#   2.96    0.11    3.07 

sapply(closest2, head)

#       nn.idx    nn.dists
# [1,]  730127 0.025803703
# [2,]  375514 0.025999069
# [3,] 2443707 0.047259283
# [4,]   62780 0.022747930
# [5,] 1431847 0.002482623
# [6,] 2199405 0.028815865

This can be compared to the output using sp::spDistsN1 (which is much slower for this problem):

library(sp)
apply(head(path), 1, function(x) which.min(spDistsN1(centers, x)))

#       1       2       3       4       5       6 
#  730127  375514 2443707   62780 1431847 2199405

Adding the point id to the path data.frame and reducing to unique values is trivial:

path$closest.id <- closest$nn.idx
output <- unique(path$closest.id)

Copyedit answered 8/12, 2014 at 14:47 Comment(5)

Thanks for this one! Reading up on RANN obviously led to studying how the approximate kd-tree search works, which in turn led to the FLANN library then to nanoflann (a header only trimmed down FLANN) and using Rcpp w/ nanoflann which further trimmed ~10% off the nn2 time.It used Eigen Mapped Matrixes to the input data.frames with exact search. – Jadwiga 9/12, 2014 at 21:3

Perhaps I am miss understanding the result...but what does the nn.idx and nn.dist mean? i.e. how does 44 relate to the relevant center id? – Spherical 10/12, 2014 at 22:56

Actually figured it out...the first part should say nn2(centers[,1:2], path, 1) – Spherical 10/12, 2014 at 23:15

@Spherical - ah, yes, you're right - otherwise it treats it as a 3-dimensional problem. I've corrected the answer now. Note that the addition of an integer sequence id column is redundant since nn2 returns, for each point in path, the index of the row of the closest centers point. – Copyedit 10/12, 2014 at 23:42

@Spherical - to answer your previous question, nn.idx is an index vector indicating which centers point is closest to each path point; nn.dists gives the distance between these closest points. – Copyedit 10/12, 2014 at 23:46

Here is an Rcpp11 solution. Something similar might work with Rcpp with a few changes.

#define RCPP11_PARALLEL_MINIMUM_SIZE 1000
#include <Rcpp11>

inline double square(double x){
    return x*x ;
}

// [[Rcpp::export]]
IntegerVector closest( DataFrame path, DataFrame centers ){

    NumericVector path_x = path["x"], path_y = path["y"] ;
    NumericVector centers_x = centers["x"], centers_y = centers["y"] ;

    int n_paths = path_x.size(), n_centers = centers_x.size() ; 


    IntegerVector ids = sapply( seq_len(n_paths), [&](int i){
            double px = path_x[i], py=path_y[i] ;

            auto get_distance = [&](int j){
                return  square(px - centers_x[j]) + square(py-centers_y[j]) ;
            } ;

            double distance = get_distance(0) ;
            int res=0;

            for( int j=1; j<n_centers; j++){
                double d = get_distance(j)  ;
                if(d < distance){
                    distance = d ;
                    res = j ;
                }
            }

            return res + 1 ;
    }) ;

    return unique(ids) ;

}

I get :

> set.seed(1)

> n <- 10000

> x <- 100 * cumprod(1 + rnorm(n, 1e-04, 0.002))

> y <- 50 * cumprod(1 + rnorm(n, 1e-04, 0.002))

> path <- data.frame(cbind(x = x, y = y))

> centers <- expand.grid(x = seq(0, 500, by = 0.5) +
+     rnorm(1001), y = seq(0, 500, by = 0.2) + rnorm(2501))

> system.time(closest(path, centers))
   user  system elapsed
 84.740   0.141  21.392

This takes advantage of automatic parallelization of sugar, i.e. sapply is run in parallel. The #define RCPP11_PARALLEL_MINIMUM_SIZE 1000 part is to force the parallel, which is otherwise by default only kicked in from 10000. But in that case since the inner computation are time consuming, it's worth it.

Note that you need a development version of Rcpp11 because unique is broken in the released version.

Selfpity answered 8/12, 2014 at 16:50 Comment(4)

I like the sound of something similar might work with Rcpp and would be up for that. You reckon it needs to go Rcpp (for sapply) proper rather than RcppParallel (for a different top-level function)? – Howarth 8/12, 2014 at 17:36

Rcpp cannot sapply lambda functions. – Selfpity 8/12, 2014 at 18:30

Did you check the resulting indexes against the given sample? They aren't identical... ? – Jadwiga 8/12, 2014 at 21:58

Ah, will do, are they by any chance perhaps just not in the same order ? – Selfpity 9/12, 2014 at 9:2

This solution reduces processing time for the sample dataset by almost half that achieved by the RANN solution.

It can be installed using devtools::install_github("thell/Rcppnanoflann")

The Rcppnanoflann solution takes advantage of Rcpp, RcppEigen and the nanoflann EigenMatrixAdaptor along with the c++11 to yield identical unique indexes to the original question.

library(Rcppnanoflann)
system.time(o.nano<-nnIndex(centers,path))

##    user  system elapsed 
##    0.62    0.05    0.67

^{_{* using path and centers values as defined in the original question}}

To achieve identical results to the original sample the RANN solution needs slight modification which we time here...

library(RANN)
system.time(o.flann<-unique(as.numeric(nn2(centers,path,1)$nn.idx)))

##    user  system elapsed 
##    1.24    0.07    1.30

identical(o.flann,o.nano)

## [1] TRUE

The working function of Rcppnanoflann takes advantage of Eigen's Map capabilities to create the input for a fixed type Eigen matrix from the given P dataframe.

Testing was done with the RcppParallel package but the kd_tree object does not have a copy constructor, so the tree needed to be created for each thread which ate up any gains in the parallel query processing.

RcppEigen and Rcpp11 currently don't play together so the idea of using Rcpp11's parallel sapply for the query isn't easily tested.

// [[Rcpp::export]]
std::vector<double> nnIndex(const Rcpp::DataFrame & P, const Rcpp::DataFrame & Q )
{
  using namespace Eigen;
  using namespace Rcpp;
  using namespace nanoflann;

  // Matrix of points to be queried against.
  const NumericVector & Px(P[0]);
  const NumericVector & Py(P[1]);
  MatrixX2d M(Px.size(), 2);
  M.col(0) = VectorXd::Map(&Px[0],Px.size());
  M.col(1) = VectorXd::Map(&Py[0],Py.size());

  // The points to query.
  const NumericVector & Qx(Q[0]);
  const NumericVector & Qy(Q[1]);
  double query_pt[2];
  size_t query_count(Qx.size());

  // Populate a 2d tree.
  KD_Tree kd_tree( 2, M, 10 );
  kd_tree.index->buildIndex();

  std::set<size_t> nn;
  std::vector<double> out;
  out.reserve(query_count);

  size_t index(0);
  double quadrance;
  for( size_t i=0 ; i < query_count; ++i ) {
    query_pt[0] = Qx[i];
    query_pt[1] = Qy[i];
    kd_tree.index->knnSearch( &query_pt[0],1, &index, &quadrance);
    if( nn.emplace(index).second ) out.emplace_back(index+1);
  }

  return out;
}

Jadwiga answered 8/12, 2014 at 1:20 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags