Optimizing execution of a CUDA kernel for Triangular Matrix calculation

I am developing my first Cuda application, and I have a kernel with "below-expected throughput", which seems to be the biggest bottleneck at the moment.

The task of the kernel is to compute an N by N sized matrix (DD) containing squared distances between all elements on a data matrix. The data matrix (Y) is size N by D (to support multi dimensional data) and stored as row-major.

Source:

__global__ void computeSquaredEuclideanDistance(const float * __restrict__ Y, float * __restrict__ DD, const int N, const int D) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < N * N; i += stride) {
        const int m = i / N;
        const int n = i % N;
        float tmp = 0;
        for (int d = 0; d < D; ++d) {
            const float Ynd = Y[d + D * n];
            const float Ymd = Y[d + D * m];
            const float Ydiff = Ynd - Ymd;
            tmp += Ydiff * Ydiff;
        }
        DD[n + N * m] = tmp;
    }
}

This is being called with size_t blockSize = 256 and size_t numBlocks = (N*N + blockSize - 1)/blockSize.

How can I optimize this kernel? My initial thought is that the time-consuming part is reading data without exploiting some sort of shared memory, but can anyone give me pointers on how to approach this?

Remarks from the nvvc profiling tool:

Latency analysis:
- Compute utilization at around 40%
- Memory (L2 cache) utilization at around 35%
Occupancy is not an issue
- Active Warps at 57.59 of a theoretical 64
- Occupancy at 90% of a theoretical 100

For my application, typical values are:

5k < N < 30k
D is either 2 or 3

I typically disregard these types of optimization questions because they are on the verge of off-topic, in my opinion. Worst still, you provide no MCVE so anyone trying to answer would have to write all their own support code to compile and benchmark your kernel. And this sort of work does require benchmarking and code analysis. But because your problem is basically a linear algebra problem (and I like linear algebra), I answered it rather than close voting it as too broad......

With that off my chest. there are a couple of things which immediately jump out in the code which could be improved and which would probably have a material affect on the run time.

The first is that the trip count of the inner loop is known a priori. Anytime you have a situation like that, let the compiler know. Loop unrolling and code reordering is a very powerful compiler optimization and the NVIDIA compiler is extremely good at it. If you move D into a template parameter, you can do something like this:

template<int D>
__device__ float esum(const float *x, const float *y)
{
    float val = 0.f;
#pragma unroll
    for(int i=0; i<D; i++) { 
        float diff = x[i] - y[i];
        val += diff * diff;
    }
    return val;
}

template<int D>
__global__ 
void vdistance0(const float * __restrict__ Y, float * __restrict__ DD, const int N)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = index; i < N * N; i += stride) {
        const int m = i / N;
        const int n = i % N;
        DD[n + N * m] = esum<D>(Y + D * n, Y + D * m);
    }
}

template __global__ void vdistance0<2>(const float *, float *, const int);
template __global__ void vdistance0<3>(const float *, float *, const int);

The compiler will inline esum and unroll the inner loop and it can then use its reordering heuristics to better interleave loads and flops to improve throughput. The resulting code has a lower register footprint too. When I run this for N=10000 and D=2, I get about 35% speed up (7.1ms versus 4.5ms on a GTX 970 with CUDA 9.1).

But there is an even more glaringly obvious optimization than this. The calculation you are performing will produce a symmetric output matrix. You only need to do (N*N)/2 operations to compute the full matrix, rather than the N*N you are doing in your code [technically N(N/2 -1) because the diagonal entries are zero, but lets forget the diagonal for the purposes of this discussion].

So taking a different approach and using one block to calculate each row of the upper triangular output matrix, then you can do something like this:

struct udiag
{
    float *p;
    int m;

    __device__ __host__ udiag(float *_p, int _m) : p(_p), m(_m) {};
    __device__ __host__ float* get_row(int i) { return p + (i * (i + 1)) / 2; };
};


template<int D>
__global__ 
void vdistance2(const float * __restrict__ Y, float * __restrict__ DD, const int N)
{
     int rowid = blockIdx.x;
     int colid = threadIdx.x;
     udiag m(DD, N);

     for(; rowid < N; rowid += gridDim.x) {
         float* p = m.get_row(rowid);
         const float* y = Y + D * rowid;
         for(int i=colid; i < (N-rowid); i += blockDim.x) {
             p[i] = esum<D>(y, y + D * i);
         }
    }
}
template __global__ void vdistance2<2>(const float *, float *, const int);
template __global__ void vdistance2<3>(const float *, float *, const int);

This uses a little helper class to encapsulate the triangle numbers needed for the addressing scheme for the upper triangular output matrix. Doing this saves an enormous amount of memory and memory bandwidth as well as reducing the total FLOP count for the calculation. If you need to do other things afterwards BLAS (and CUBLAS) supports computations on upper or lower triangular matrices. Use them. When I run this I get about 75% speedup (7.1ms versus 1.6ms on the same GTX 970).

Huge disclaimer: All the code you see here was written during a 45 minute lunch break and as been very lightly tested. I make absolutely no claims that anything in this answer is actually correct. I have confirmed that it compiles and doesn't produce a runtime error when I run it to get profiling data. That is it. Cavaet Emptor and all that.

Recommended topics

Hot tags