Cache friendly method to multiply two matrices

Asked 9/11, 2012 at 17:4 Answered 21/10, 2015 at 23:41

I intend to multiply 2 matrices using the cache-friendly method ( that would lead to less number of misses)

I found out that this can be done with a cache friendly transpose function.

But I am not able to find this algorithm. Can I know how to achieve this?

Bronwynbronx answered 9/11, 2012 at 17:4 Comment(0)

The word you are looking for is thrashing. Searching for thrashing matrix multiplication in Google yields more results.

A standard multiplication algorithm for c = a*b would look like

void multiply(double[,] a, double[,] b, double[,] c)
{
    for (int i = 0; i < n; i++)
        for (int j = 0; j < n; j++)
            for (int k = 0; k < n; k++)
                C[i, j] += a[i, k] * b[k, j]; 
}

Basically, navigating the memory fastly in large steps is detrimental to performance. The access pattern for k in B[k, j] is doing exactly that. So instead of jumping around in the memory, we may rearrange the operations such that the most inner loops operate only on the second access index of the matrices:

void multiply(double[,] a, double[,] B, double[,] c)
{  
   for (i = 0; i < n; i++)
   {  
      double t = a[i, 0];
      for (int j = 0; j < n; j++)
         c[i, j] = t * b[0, j];

      for (int k = 1; k < n; k++)
      {
         double s = 0;
         for (int j = 0; j < n; j++ )
            s += a[i, k] * b[k, j];
         c[i, j] = s;
      }
   }
}

This was the example given on that page. However, another option is to copy the contents of B[k, *] into an array beforehand and use this array in the inner loop calculations. This approach is usually much faster than the alternatives, even if it involves copying data around. Even if this might seem counter-intuitive, please feel free to try for yourself.

void multiply(double[,] a, double[,] b, double[,] c)
{
    double[] Bcolj = new double[n];
    for (int j = 0; j < n; j++)
    {
        for (int k = 0; k < n; k++)
            Bcolj[k] = b[k, j];

        for (int i = 0; i < n; i++)
        {
            double s = 0;
            for (int k = 0; k < n; k++)
                s += a[i,k] * Bcolj[k];
            c[j, i] = s;
        }
   }
}

Unsaid answered 26/1, 2013 at 17:56 Comment(4)

in your second code block, c[i, j] = s;, but it seems that j is not declared in that scope. – Laborer 18/4, 2017 at 19:52

I'm wondering why this is the accepted answer, the inner loop over k is accessing a by column, totally wrong from performance point of view. – Electrocute 9/2, 2018 at 18:18

The code is assuming a C-like language, where matrices are row-major. When accessing a matrix stored in row-major order using a[i,j] you should always make sure that j always changes faster than i if you want to maximize performance. – Unsaid 13/2, 2018 at 19:9

the second code snippet is wrong – Carbonado 10/11, 2021 at 12:49

@Cesar's answer is not correct. For example, the inner loop

for (int k = 0; k < n; k++)
   s += a[i,k] * Bcolj[k];

goes through the i-th column of a.

The following code should ensure we always visit data row by row.

void multiply(const double (&a)[I][K], 
              const double (&b)[K][J], 
              double (&c)[I][J]) 
{
    for (int j=0; j<J; ++j) {
       // iterates the j-th row of c
       for (int i=0; i<I; ++i) {
         c[i][j] = 0;
       } 

       // iterates the j-th row of b
       for (int k=0; k<K; ++k) {
          double t = b[k][j];
          // iterates the j-th row of c
          // iterates the k-th row of a
          for (int i=0; i<I; ++i) {
            c[i][j] += a[i][k] * t;
          } 
       }
    }
}

Samal answered 21/10, 2015 at 23:41 Comment(3)

Your code is wrong too. The reset of c[i][j] could be totally optional (it depends if the caller reset the matrix to zero). In addition the loop over k starts from 1 but it should starts from zero. – Electrocute 9/2, 2018 at 18:16

@Electrocute c[i][j] needs to reset, because the accumulation of "c[i][j] += a[i][k] * t;" needs an initial value. "k starts from 0" is correct. fixed. – Samal 9/2, 2018 at 22:23

Yes, I know but if the caller did a memset to zero for example, the loop is not needed. Add a comment in your code to clarify. – Electrocute 10/2, 2018 at 6:33

Recommended topics

Hot tags