I learnt few things along the way trying to find vectorized and faster ways to solve it.
1) First off, there is a dependency of iterators at "for j in range(i)"
. From my previous experience, especially with trying to solve such problems on MATLAB
, it appeared that such dependency could be taken care of with a lower triangular matrix
, so np.tril
should work there. Thus, a fully vectorized solution and not so memory efficient solution (as it creates an intermediate (N,N)
shaped array before finally reducing to (N,)
shaped array) would be -
def fully_vectorized(a,b):
return np.tril(np.einsum('ijk,jil->kl',a,b),-1).sum(1)
2) Next trick/idea was to keep one loop for the iterator i
in for i in range(N)
, but insert that dependency with indexing and using np.einsum
to perform all those multiplications and summations. The advantage would be memory-efficiency. The implementation would look like this -
def einsum_oneloop(a,b):
d = np.zeros(N)
for i in range(N):
d[i] = np.einsum('ij,jik->',a[:,:,i],b[:,:,np.arange(i)])
return d
There are two more obvious ways to solve it. So, if we start working from the original all_loopy
solution, one could keep the outer two loops and use np.einsum
or np.tensordot
to perform those operations and thus remove the inner two loops, like so -
def tensordot_twoloop(a,b):
d = np.zeros(N)
for i in range(N):
for j in range(i):
d[i] += np.tensordot(a[:,:,i],b[:,:,j], axes=([1,0],[0,1]))
return d
def einsum_twoloop(a,b):
d = np.zeros(N)
for i in range(N):
for j in range(i):
d[i] += np.einsum('ij,ji->',a[:,:,i],b[:,:,j])
return d
Runtime test
Let's compare all the five approaches posted thus far to solve the problem, including the one posted in the question.
Case #1 :
In [26]: # Input arrays with random elements
...: m,n,N = 20,20,20
...: a = np.random.rand(m,n,N)
...: b = np.random.rand(n,m,N)
...:
In [27]: %timeit all_loopy(a,b)
...: %timeit tensordot_twoloop(a,b)
...: %timeit einsum_twoloop(a,b)
...: %timeit einsum_oneloop(a,b)
...: %timeit fully_vectorized(a,b)
...:
10 loops, best of 3: 79.6 ms per loop
100 loops, best of 3: 4.97 ms per loop
1000 loops, best of 3: 1.66 ms per loop
1000 loops, best of 3: 585 µs per loop
1000 loops, best of 3: 684 µs per loop
Case #2 :
In [28]: # Input arrays with random elements
...: m,n,N = 50,50,50
...: a = np.random.rand(m,n,N)
...: b = np.random.rand(n,m,N)
...:
In [29]: %timeit all_loopy(a,b)
...: %timeit tensordot_twoloop(a,b)
...: %timeit einsum_twoloop(a,b)
...: %timeit einsum_oneloop(a,b)
...: %timeit fully_vectorized(a,b)
...:
1 loops, best of 3: 3.1 s per loop
10 loops, best of 3: 54.1 ms per loop
10 loops, best of 3: 26.2 ms per loop
10 loops, best of 3: 27 ms per loop
10 loops, best of 3: 23.3 ms per loop
Case #3 (Leaving out all_loopy for being very slow) :
In [30]: # Input arrays with random elements
...: m,n,N = 100,100,100
...: a = np.random.rand(m,n,N)
...: b = np.random.rand(n,m,N)
...:
In [31]: %timeit tensordot_twoloop(a,b)
...: %timeit einsum_twoloop(a,b)
...: %timeit einsum_oneloop(a,b)
...: %timeit fully_vectorized(a,b)
...:
1 loops, best of 3: 1.08 s per loop
1 loops, best of 3: 744 ms per loop
1 loops, best of 3: 568 ms per loop
1 loops, best of 3: 866 ms per loop
Going by the numbers, einsum_oneloop
looks pretty good to me, whereas fully_vectorized
could be used when dealing with small to decent sized arrays!
numba
much and I had NumPy in mind while posting the question, but this is really good work, those gradual improvements and explanations! I will keep the question open, if that's okay. – Ratel