Speeding up the L1 distance between all pairs in a ground set
Asked Answered
P

2

14

I have a matrix NxM (usually 10k X 10k elements) describing a ground set. Each line represents an object and each column an specific feature. For example, in the matrix

   f1 f2 f3
x1 0  4  -1
x2 1  0  5
x3 4  0  0
x4 0  1  0

the object x1 has value 0 in feature 1, value 4 in feature 1, and value 0 in feature -1. The values of this are general real numbers (double's).

I have to compute several custom distances/dissimilarities between all pairs of objects (all pair of lines). To compare, I want to compute the L1 (manhattan) and L2 (euclidean) distances.

I have use Eigen library to perform the most of my computations. To compute the L2 (euclidean), I use the following observation: for two vectors a and b of size n, we have:

||a - b||^2 = (a_1 - b_1)^2 + (a_2 - b_2)^2 + ... +(a_n - b_n)^2
            = a_1^2 + b_1^2 - 2 a_1 b_1 + a_2^2 + b_2^2 - 2 a_2 b_2 + ... + a_n^2 + b_n^2 - 2 a_n b_n
            = a . a + b . b - 2ab

In other words, we rewrite the squared norm using the dot product of the vector by themselves and subtract twice the dot product between them. From that, we just take the square and we are done. In time, I found this trick a long time ago and unfortunately I lost the reference for the author.

Anyway, this enable to write a fancy code using Eigen (in C++):

Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> matrix, XX, D;

// Load matrix here, for example
// matrix << 0, 4, -1,
//           1, 0,  5,
//           4, 0,  0,
//           0, 1,  0;

const auto N = matrix.rows();

XX.resize(N, 1);
D.resize(N, N);

XX = matrix.array().square().rowwise().sum();

D.noalias() = XX * Eigen::MatrixXd::Ones(1, N) +
              Eigen::MatrixXd::Ones(N, 1) * XX.transpose();

D -= 2 * matrix * matrix.transpose();
D = D.cwiseSqrt();

For matrix 10k X 10k, we are able to compute the L2 distance for all pairs of objects/lines in less than 1 min (2 cores / 4 threads), which I personally consider a good performance for my purposes. Eigen is able to combine the operations and to use several low/high level optimizations to perform these computations. In this case, Eigen uses all cores available (and, of course, we can tune that).

However, I still need compute the L1 distance, but I couldn't figure out a good algebraic form to use with Eigen and obtain nice performance. Until now I have the following:

const auto N = matrix.rows();
for(long i = 0; i < N - 1; ++i) {
    const auto &row = matrix.row(i);

    #ifdef _OPENMP
    #pragma omp parallel for shared(row)
    #endif
    for(long j = i + 1; j < N; ++j) {
        distance(i, j) = (row - matrix.row(j)).lpNorm<1>();
    }
}

But this take much longer time: for the same 10k X 10k matrix, this code uses 3.5 min, which is much worse considering that L1 and L2 are very close in their original form:

L1(a, b) = sum_i |a_i - b_i|
L2(a, b) = sqrt(sum_i |a_i - b_i|^2)

Any idea to how change L1 to use fast computations with Eigen? Or a better form to do that and I just didn't figure out.

Thank you very much for your help!

Carlos

Prehistory answered 25/5, 2015 at 22:17 Comment(11)
This does not answer your question, but note that if you have only 2 physical cores, then you should enable only 2 threads as hyperthreading slows down CPU intensive operations. You can also initialize D using replicate: D = XX.replicate(1,n) + XX.transpose().replicate(n,1);Aubine
@Aubine Indeed, I always use only the number of physical cores and, when possible, I turn the hyperthreading off in the machines. BTW, thanks for the tip.Prehistory
L2 can be done in O(N^2.81) using Strassen algorithm for fast matrix multiplication that the library might already using. but L1 is so straight forward to take O(N^3) to complete. that might be the reason L1 is slower.Closestool
Can you map (a-b) to +1 or -1 into a sign vector depending on sign of a_i-b_i and then do |a-b| = (a-b).sign ?Sweven
I'm going to go out on a limb here... Notice that you're manipulating rows. However, by default, Eigen matrices are in column-major order (eigen.tuxfamily.org/dox-devel/group__QuickRefPage.html). That means that whenever you're calling row(), Eigen has to read from a lot of non-contiguous areas of memory. Could it be that you will get better performance/lower number of cache misses if you switch to row-major order? Note that matrix multiplications used by the L2 norm aren't as affected by this, since the underlying operations are optimized for both orders via the 'T' parameters in dgemmThrelkeld
@PatrickMineault Yes, you are right. Indeed, I have change the matrix order to speed up a little bit. It makes a good difference but not that one I am looking for. Anyway, thanks for the notice.Prehistory
If it was possible to use 8-bit values, you could use the instruction or intrinsic _mm_mpsadbw_epu8. In this way, you can do 8 8-byte sum of absolute differences in 9 clock cycles. software.intel.com/sites/default/files/m/a/9/b/7/b/1000-SSE.pdf.Fury
Googling brings up a paper talking about using GPUs, as opposed to algorithmic improvements, so there might not be a known clever way to reduce it below O(n^3) work. There's a similar question of the math stack exchange, but it also has no answers.Unprintable
@JensMunk I will check that. I have mixed cases. Sometimes, I have just binary vectors, others I have vectors with small or large "double" values. I have to check if I can downsize these values without loss precision, but my guess that is not possible. Thanks for the link.Prehistory
@Prehistory By the way, if you have sum-of-absolute differences of 8-bit values, you can easily obtain the same for 16-bit values by combining two 8-bit sum of absolute differences.Fury
I can't help you mathematically but I can assure you that there are faster ways to do it programmatically. for a 10k by 10k matrix may be worth considering the GPU. Also my own experience shows that using SIMD vector instructions is much faster than just using openmp for parallelization. So regardless of using openmp or not you should write your code to use SIMDDamara
S
2

Lets just calculate both distances at the same time. They only really share the row difference (while both could be absolute difference, the euclidean distance uses square which isn't really equivalent). So now we're only looping through the rows n^2.

const auto N = matrix.rows();
for(long i = 0; i < N - 1; ++i) {
    const auto &row = matrix.row(i);

    #ifdef _OPENMP
    #pragma omp parallel for shared(row)
    #endif
    for(long j = i + 1; j < N; ++j) {
        const auto &rowDiff = row - matrix.row(j);
        distanceL1(i, j) = rowDiff.cwiseAbs().sum(); // or .lpNorm<1>(); if it's faster
        distanceL2(i, j) = rowDiff.norm()
    }
}

EDIT another more memory intensive / untested method could be to calculate a whole distance row each iteration (don't know if this would be an improvement or not)

const auto N = matrix.rows();
#ifdef _OPENMP
#pragma omp parallel for shared(matrix)
#endif
for(long i = 0; i < N - 1; ++i) {
    const auto &row = matrix.row(i);
    // you could use matrix.block(i,j,k,l) to cut down on the number of unnecessary operations
    const auto &mat = matrix.rowwise() - row;

    distanceL1(i) = mat.cwiseAbs().sum().transpose();
    distanceL2(i) = mat.rowwise().norm().transpose();
}
Sperm answered 3/8, 2015 at 18:39 Comment(0)
T
0

These are two very common operations in image processing. The first the Sum of Squared Differences (SSD), the second is the Sum of Absolute Differences (SAD).

You've correctly determined that SSD just requires one to compute the cross-correlation between the two series as the main computation. However, you may want to consider using the FFT to compute these a.b terms, it will cut down the number of operations needed significantly for the L2 case (however how much I do not know, it depends on what matrix-matrix multiplication algorithm Eigen uses.) If you need me to explain this I can, but I figure you can also look it up as its a standard use of FFTs. OpenCV has a (pretty bad/buggy) implementation of template matching, which is what you want when using CV_TM_SQDIFF.

L1 case is trickier. The L1 case can't be decomposed as nicely, but it is also one of the simplest operations you can do (just additions & absolute values.) As such, a lot of computation architectures have parallelized implementations of this as instructions or hardware implemented functions. Other architectures have researchers experimenting with the best way to compute this.

You may also want to look into Intel Imaging Primitives for cross-correlation, and fast FFT libraries such as FFTW & CUFFT. If you can't afford the Intel Imaging Primitves, you can use SSE instructions to greatly speed up your processing to almost the same effect.

Torre answered 4/12, 2015 at 0:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.