Why is NumPy subtraction slower on one large matrix $M$ than when dividing $M$ into smaller matrices and then subtracting?

I'm working on some code where I have several matrices and want to subtract a vector $v$ from each row of each matrix (and then do some other stuff with the result). As I'm using NumPy and want to 'vectorise' as much as possible, I thought I'd speed up my running time by storing all the matrices as one large ('concatenated') matrix and subtracting $v$ from that. The issue is that my code runs slower after this supposed optimisation. In fact, in some scenarios breaking up the matrices and subtracting separately is significantly faster (see code example below).

Can you tell me what is causing this? Naively, I would assume that both approaches require the same number of elementary subtraction operations and the large matrix approach is faster as we avoid looping through all matrices separately with a pure Python loop.

Initially, I thought the slow-down may be due to initialising a larger matrix to store the result of subtracting. To test this, I initialised a large matrix outside my test function and passed it to the np.subtract command. Then I thought that broadcasting may be causing the slow performance, so I manually broadcast the vector into the same shape as large matrix and then subtracted the resulting broadcasted matrix. Both attempts have failed to make the large matrix approach competitive.

I've made the following MWE to showcase the issue.

Import NumPy and a timer:

import numpy as np
from timeit import default_timer as timer

Then I have some parameters that control the size and number of matrices.

n = 100  # width of matrix
m = 500  # height of matrix
k = 100  # number of matrices
M = 100  # upper bound on entries
reps = 100  # repetitions for timings

We can generate a list of test matrices as follows. The large matrix is just the concatenation of all matrices in the list. The vector we subtract from the matrices is randomly generated.

list_of_matrices = [np.random.randint(0, M+1, size=(m,n)) for _ in range(k)]
large_matrix = np.row_stack(list_of_matrices)
vector = np.random.randint(0, M+1, size=n)

Here are the three functions I use to evaluate the speed of subtraction. The first subtracts the vector from each matrix in the list, the second subtracts the vector from the (concatenated) large matrix and the last function is an attempt to speed up the latter approach by pre_initialising an output matrix and broadcasting the vector.

def list_compute(list_of_matrices, vector):
    for j in range(k):
        np.subtract(list_of_matrices[j], vector)

def array_compute(bidlists, vector):
    np.subtract(large_matrix, vector_matrix, out=pre_allocated)

pre_allocated = np.empty(shape=large_matrix.shape)
vector_matrix = np.broadcast_to(vector, shape=large_matrix.shape)
def faster_array_compute(large_matrix, vector_matrix, out_matrix):
    np.subtract(large_matrix, vector_matrix, out=out_matrix)

I benchmark the three functions by running

start = timer()
for _ in range(reps):
    list_compute(list_of_matrices, vector)
print timer() - start

start = timer()
for _ in range(reps):
    array_compute(large_matrix, vector)
print timer() - start

start = timer()
for _ in range(reps):
    faster_array_compute(large_matrix, vector_matrix, pre_allocated)
print timer() - start

For the above parameters, I get timings of

0.539432048798
1.12959504128
1.10976290703

Naively, I would expect the large matrix approach to be faster or at least competitive compared to the several matrices approach. I hope someone can give me some insights into why this is not the case and how I can speed up my code!

The type of the variable pre_allocated is float8. The input matrices are int. You have an implicit conversion. Try to modify the pre-allocation to:

pre_allocated = np.empty_like(large_matrix)

Before the change, the execution times on my machine were:

0.6756095182868318
1.2262537249271794
1.250292605883855

After the change:

0.6776479894965846
0.6468182835551346
0.6538956945388001

The performance is similar in all cases. There is a large variance in those measurements. One may even observe that the first one is the fastest.

It seams that there is no gain due to pre-allocation.

Note that the allocation is very fast because it reserves only address space. The RAM is consumed only on access event actually. The buffer is 20MiB thus it is larger that L3 caches on the CPU. The execution time will be dominated by page faults and refilling of the caches. Moreover, for the first case the memory is re-allocated just after being freed. The resource is likely to be "hot" for the memory allocator. Therefore you cannot directly compare solution A with others.

Modify the "action" line in the first case to keep the actual result:

        np.subtract(list_of_matrices[j], vector, out=pre_allocated[m*j:m*(j+1)])

Then the gain from vectorized operations becomes more observable:

0.8738251849091547
0.678185239557866
0.6830777283598941

Recommended topics

Hot tags