I'm working on some code where I have several matrices and want to subtract a vector $v$ from each row of each matrix (and then do some other stuff with the result). As I'm using NumPy and want to 'vectorise' as much as possible, I thought I'd speed up my running time by storing all the matrices as one large ('concatenated') matrix and subtracting $v$ from that. The issue is that my code runs slower after this supposed optimisation. In fact, in some scenarios breaking up the matrices and subtracting separately is significantly faster (see code example below).
Can you tell me what is causing this? Naively, I would assume that both approaches require the same number of elementary subtraction operations and the large matrix approach is faster as we avoid looping through all matrices separately with a pure Python loop.
Initially, I thought the slow-down may be due to initialising a larger matrix to store the result of subtracting. To test this, I initialised a large matrix outside my test function and passed it to the np.subtract command. Then I thought that broadcasting may be causing the slow performance, so I manually broadcast the vector into the same shape as large matrix and then subtracted the resulting broadcasted matrix. Both attempts have failed to make the large matrix approach competitive.
I've made the following MWE to showcase the issue.
Import NumPy and a timer:
import numpy as np
from timeit import default_timer as timer
Then I have some parameters that control the size and number of matrices.
n = 100 # width of matrix
m = 500 # height of matrix
k = 100 # number of matrices
M = 100 # upper bound on entries
reps = 100 # repetitions for timings
We can generate a list of test matrices as follows. The large matrix is just the concatenation of all matrices in the list. The vector we subtract from the matrices is randomly generated.
list_of_matrices = [np.random.randint(0, M+1, size=(m,n)) for _ in range(k)]
large_matrix = np.row_stack(list_of_matrices)
vector = np.random.randint(0, M+1, size=n)
Here are the three functions I use to evaluate the speed of subtraction. The first subtracts the vector from each matrix in the list, the second subtracts the vector from the (concatenated) large matrix and the last function is an attempt to speed up the latter approach by pre_initialising an output matrix and broadcasting the vector.
def list_compute(list_of_matrices, vector):
for j in range(k):
np.subtract(list_of_matrices[j], vector)
def array_compute(bidlists, vector):
np.subtract(large_matrix, vector_matrix, out=pre_allocated)
pre_allocated = np.empty(shape=large_matrix.shape)
vector_matrix = np.broadcast_to(vector, shape=large_matrix.shape)
def faster_array_compute(large_matrix, vector_matrix, out_matrix):
np.subtract(large_matrix, vector_matrix, out=out_matrix)
I benchmark the three functions by running
start = timer()
for _ in range(reps):
list_compute(list_of_matrices, vector)
print timer() - start
start = timer()
for _ in range(reps):
array_compute(large_matrix, vector)
print timer() - start
start = timer()
for _ in range(reps):
faster_array_compute(large_matrix, vector_matrix, pre_allocated)
print timer() - start
For the above parameters, I get timings of
0.539432048798
1.12959504128
1.10976290703
Naively, I would expect the large matrix approach to be faster or at least competitive compared to the several matrices approach. I hope someone can give me some insights into why this is not the case and how I can speed up my code!