Advice on vectorizing block-wise operations in Numpy

Asked 7/10, 2020 at 3:4 Answered 9/10, 2020 at 8:22

Solved python performance numpy scipy scientific-computing

I am trying to implement a series of statistical operations, and I need help vectorizing my code.

The idea is to extract NxN patches from two images, the compute a distance metric between these two patches.

To do this, first I construct the patches with the following loops:

params = []
for i in range(0,patch1.shape[0],1):
    for j in range(0,patch1.shape[1],1):
        window1 = np.copy(imga[i:i+N,j:j+N]).flatten()
        window2 = np.copy(imgb[i:i+N,j:j+N]).flatten()
        params.append((window1, window2))
print(f"We took {time()- t0:2.2f} seconds to prepare {len(params)/1e6} million patches.")

This takes about 10 seconds to complete, and I'm not overly concerned with the pre-processing time. The steps below that follow are the steps I want to optimize.

After this, in an attempt to speed up processing I used multipool to compute the actual results. The function that contains the actual computation is as follows:

@njit
def cauchy_schwartz(imga, imgb):
    p, _ = np.histogram(imga, bins=10)
    p = p/np.sum(p)
    q, _ = np.histogram(imgb, bins=10)
    q = q/np.sum(q)

    n_d = np.array(np.sum(p * q)) 
    d_d = np.array(np.sum(np.power(p, 2) * np.power(q, 2)))
    return -1.0 * np.log10( n_d, d_d)

I use this structure to process all the patches:

def f(param):
    return cauchy_schwartz(*param)

with Pool(4) as p:
    r = list(tqdm.tqdm(p.imap(f,params), total=len(params)))

I am sure there must be something much more elegant to do this, because if I send the whole 10Kpx by 10Kpx images to the cauchy_schwartz function it processes everything in under a second, but with my approach, even on 4 cores it takes a long time.

My mental model is how blockproc in matlab works - and I ended up writing this code in that pattern. I would appreciate any advice on improving the performance of this code.

Prebendary answered 7/10, 2020 at 3:4 Comment(0)

By using apply_along_axis, you can get rid of cauchy_schwartz. Since you are not overly concerned with the pre-processing time, assume you have obtained the array params which contains the flattened patches

params = np.random.rand(3,2,100)

as you can see the shape of params is (3,2,100), the three numbers 3, 2, and 100 are just randomly chosen to create an auxiliary array to demonstrate the logic of using apply_along_axis. 3 corresponds to the number of patches you have (determined by the patch shape and the image size), 2 corresponds to the two images, and 100 corresponds to the flattened patches. Therefore, the axes of params is (idx of patches, idx of images, idx of entries of a flattened patch), this exactly matches the list params created by your code

params = []
for i in range(0,patch1.shape[0],1):
    for j in range(0,patch1.shape[1],1):
        window1 = np.copy(imga[i:i+N,j:j+N]).flatten()
        window2 = np.copy(imgb[i:i+N,j:j+N]).flatten()
        params.append((window1, window2))

With the auxiliary array params, here is my solution:

hist = np.apply_along_axis(lambda x: np.histogram(x,bins=11)[0],2,params)
hist = hist / np.sum(hist,axis=2)[...,None]

n_d = np.sum(np.product(hist,axis=1),axis=1)
d_d = np.sum(np.product(np.power(hist,2),axis=1),axis=1)
res = -1.0 * np.log10(n_d, d_d)

Alkanet answered 9/10, 2020 at 8:22 Comment(3)

Thank you - this is exactly what I was looking for, and it also helps me learn the logic. Was just a little curious to understand why you chose 3 as the first dimension for the params in random.rand(3,2,100). Let me experiment and think and try to understand this a little. But if you could help explain - it would save me guessing. In your solution is the N (window size) 3, depth 2 (for the 2 patches) and 100 samples/blocks? – Prebendary 9/10, 2020 at 21:23

@Prebendary You are welcome :) glad my answer helps you, I have updated my answer to explain the axes of the array params. – Alkanet 9/10, 2020 at 23:52

Thank you for the update. I came here to say that with some trial and error I was able to understand, but thank you for the update! I also learned a lot about writing good numpy code. Thank you! – Prebendary 10/10, 2020 at 20:55

First of all, profile your code to identify the bottleneck. You can use https://mg.pov.lt/profilehooks/. I think the bottleneck is located in the creation of the patches since you are creating a copy of the patches for the processes. You could use less memory by passing the indices of the patches only:

params = []
for i in range(0,patch1.shape[0],1):
    for j in range(0,patch1.shape[1],1):
        start, end = (i,i+N), (j,j+N)
        params.append((start, end))

Then, assuming imga and imgb are global, you can create the patches from cauchy_schwartz function as shown below:

@njit
def cauchy_schwartz(start, end):

    a,b = start; c,d = end
    window1 = np.copy(imga[a:b, c:d]).flatten()
    window2 = np.copy(imgb[a:b, c:d]).flatten()

    # process patches window1 and window2

Ashton answered 7/10, 2020 at 8:14 Comment(6)

Thank you for the answer, but that is not the slow step. I'm able to Numba compile the pre-processing code. Its insignificant when compared to the slower histogram computation and patch-wise processing in the next steps. The np.copy let's me delete the original arrays from memory. I could change them to references, but it doesn't change the compute time as much. – Prebendary 7/10, 2020 at 8:36

Maybe your code causes false sharing: False sharing occurs when threads on different processors modify variables that reside on the same cache line source. – Ashton 7/10, 2020 at 21:8

Interesting! Do you have any suggestions on how to measure that? Or steps I can take to mitigate it? – Prebendary 8/10, 2020 at 4:19

Maybe you could use different patch sizes (patch1.shape in your code). I hypothesize that a bigger patch size could lead to better performance (smaller execution time). – Ashton 9/10, 2020 at 0:37

Thanks for the suggestion. I think you are correct in that assesment. If I pass the whole image to the function it computes it in a second. But if I pass numerous small blocks - it takes a lot longer (~120s). I am assuming this has to do with python's internal data model. I am trying to think of ways of aggregating or structuring this problem so that I can use numba's vectorize or do the whole operation using numpy. Will keep this post updated with any findings! Thank you for taking the time to help! [1/2] – Prebendary 9/10, 2020 at 0:50

Following your suggestion I did try a couple of things: 1. If I don't np.copy - the hypothesis holds. Bigger block sizes leads to better performance - even for overlapping blocks. 2. When I do the np.copy - the runtime remains constant[2/2] – Prebendary 9/10, 2020 at 0:51

Recommended topics

Hot tags