Using strides for an efficient moving average filter

Asked 8/2, 2011 at 18:5 Answered 9/2, 2011 at 16:12

Solved python image-processing filter numpy

I recently learned about strides in the answer to this post, and was wondering how I could use them to compute a moving average filter more efficiently than what I proposed in this post (using convolution filters).

This is what I have so far. It takes a view of the original array then rolls it by the necessary amount and sums the kernel values to compute the average. I am aware that the edges are not handled correctly, but I can take care of that afterward... Is there a better and faster way? The objective is to filter large floating point arrays up to 5000x5000 x 16 layers in size, a task that scipy.ndimage.filters.convolve is fairly slow at.

Note that I am looking for 8-neighbour connectivity, that is a 3x3 filter takes the average of 9 pixels (8 around the focal pixel) and assigns that value to the pixel in the new image.

import numpy, scipy

filtsize = 3
a = numpy.arange(100).reshape((10,10))
b = numpy.lib.stride_tricks.as_strided(a, shape=(a.size,filtsize), strides=(a.itemsize, a.itemsize))
for i in range(0, filtsize-1):
    if i > 0:
        b += numpy.roll(b, -(pow(filtsize,2)+1)*i, 0)
filtered = (numpy.sum(b, 1) / pow(filtsize,2)).reshape((a.shape[0],a.shape[1]))
scipy.misc.imsave("average.jpg", filtered)

EDIT Clarification on how I see this working:

Current code:

use stride_tricks to generate an array like [[0,1,2],[1,2,3],[2,3,4]...] which corresponds to the top row of the filter kernel.
Roll along the vertical axis to get the middle row of the kernel [[10,11,12],[11,12,13],[13,14,15]...] and add it to the array I got in 1)
Repeat to get the bottom row of the kernel [[20,21,22],[21,22,23],[22,23,24]...]. At this point, I take the sum of each row and divide it by the number of elements in the filter, giving me the average for each pixel, (shifted by 1 row and 1 col, and with some oddities around edges, but I can take care of that later).

What I was hoping for is a better use of stride_tricks to get the 9 values or the sum of the kernel elements directly, for the entire array, or that someone can convince me of another more efficient method...

Lupine answered 8/2, 2011 at 18:5 Comment(10)

I tried running your code, but got a memory corruption error. I'm running Python 2.6.6 and Numpy 1.3.0 on Ubuntu 10.10, 64-bit. The error looks like *** glibc detected *** python: double free or corruption (!prev): 0x0000000002526d30 ***. – Shuttlecock 8/2, 2011 at 18:55

Can I ask why you are using floats (I presume 64bit) to represent an image that can be (probably) more efficiently stored and calculated using ints? – Montemontefiascone 8/2, 2011 at 18:58

Your example is a 2D array, yet you describe your data as 3D. Are you doing this operation for each of 16 layers? – Montemontefiascone 8/2, 2011 at 19:0

@mtrw: I am on Python 2.6.6. and Numpy 1.4.1 on Windows XP SP3. I have no idea what that error means! – Lupine 8/2, 2011 at 19:0

@Paul: data is 3D (an image with 16 channels) but can be filtered as individual layers. I am using floats because the the values are radar backscatter amplitude values and truncating or rescaling is not an option. I will eventually need to use Float32. – Lupine 8/2, 2011 at 19:3

I'm having a hard time following the intent of your code. Could you throw in some comments? One thing that stands out is that you are modifying a. – Montemontefiascone 8/2, 2011 at 19:18

For what it's worth, the "corruption" error is arising because the shape and strides that you're passing in is incorrect. The shape and strides that you're giving assumes that you're passing in a 1D array. – Mook 8/2, 2011 at 19:27

To clarify, you can use an n-D array with with as_strided, as you are doing, it just doesn't care. It operates on a.flat anyway. – Montemontefiascone 8/2, 2011 at 19:36

See my update. I'm wondering where the speedups are compared to a standard approach. Are you expecting to get significantly better than ~2 sec with a 5k by 5k array? – Montemontefiascone 8/2, 2011 at 23:38

It appears that scipy.ndimage.uniform_filter is the function to beat, especially with large arrays (5000x5000) and larger kernel sizes (try 11x11). Thanks to everyone for their explanations of strides and suggestions of better options. I got more out of this question than I expected! Now I have to figure out who gets the accepted answer :( – Lupine 9/2, 2011 at 16:29

For what it's worth, here's how you'd do it using "fancy" striding tricks. I was going to post this yesterday, but got distracted by actual work! :)

@Paul & @eat both have nice implementations using various other ways of doing this. Just to continue things from the earlier question, I figured I'd post the N-dimensional equivalent.

You're not going to be able to significantly beat scipy.ndimage functions for >1D arrays, however. (scipy.ndimage.uniform_filter should beat scipy.ndimage.convolve, though)

Moreover, if you're trying to get a multidimensional moving window, you risk having memory usage blow up whenever you inadvertently make a copy of your array. While the initial "rolling" array is just a view into the memory of your original array, any intermediate steps that copy the array will make a copy that is orders of magnitude larger than your original array (i.e. Let's say that you're working with a 100x100 original array... The view into it (for a filter size of (3,3)) will be 98x98x3x3 but use the same memory as the original. However, any copies will use the amount of memory that a full 98x98x3x3 array would!!)

Basically, using crazy striding tricks is great for when you want to vectorize moving window operations on a single axis of an ndarray. It makes it really easy to calculate things like a moving standard deviation, etc with very little overhead. When you want to start doing this along multiple axes, it's possible, but you're usually better off with more specialized functions. (Such as scipy.ndimage, etc)

At any rate, here's how you do it:

import numpy as np

def rolling_window_lastaxis(a, window):
    """Directly taken from Erik Rigtorp's post to numpy-discussion.
    <http://www.mail-archive.com/[email protected]/msg29450.html>"""
    if window < 1:
       raise ValueError, "`window` must be at least 1."
    if window > a.shape[-1]:
       raise ValueError, "`window` is too long."
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def rolling_window(a, window):
    if not hasattr(window, '__iter__'):
        return rolling_window_lastaxis(a, window)
    for i, win in enumerate(window):
        if win > 1:
            a = a.swapaxes(i, -1)
            a = rolling_window_lastaxis(a, win)
            a = a.swapaxes(-2, i)
    return a

filtsize = (3, 3)
a = np.zeros((10,10), dtype=np.float)
a[5:7,5] = 1

b = rolling_window(a, filtsize)
blurred = b.mean(axis=-1).mean(axis=-1)

So what we get when we do b = rolling_window(a, filtsize) is an 8x8x3x3 array, that's actually a view into the same memory as the original 10x10 array. We could have just as easily used different filter size along different axes or operated only along selected axes of an N-dimensional array (i.e. filtsize = (0,3,0,3) on a 4-dimensional array would give us a 6 dimensional view).

We can then apply an arbitrary function to the last axis repeatedly to effectively calculate things in a moving window.

However, because we're storing temporary arrays that are much bigger than our original array on each step of mean (or std or whatever), this is not at all memory efficient! It's also not going to be terribly fast, either.

The equivalent for ndimage is just:

blurred = scipy.ndimage.uniform_filter(a, filtsize, output=a)

This will handle a variety of boundary conditions, do the "blurring" in-place without requiring a temporary copy of the array, and be very fast. Striding tricks are a good way to apply a function to a moving window along one axis, but they're not a good way to do it along multiple axes, usually....

Just my $0.02, at any rate...

Mook answered 9/2, 2011 at 16:12 Comment(4)

Very well put:

Striding tricks are a good way to apply a function to a moving window along one axis, but they're not a good way to do it along multiple axes, usually....

. And of course your explanation of the memory 'blow up' is important one. Kind of summary from your answer (at least for me) is: 'don't go too far fishing, the quarenteed catch is allready in scipy'. Thanks – Averil 9/2, 2011 at 16:37

Thanks, Joe, for this answer. In rolling_window should the if not hasattr(...): be returning rolling_window_lastaxis(...) rather than rolling_window? – Nonfiction 12/2, 2011 at 16:47

@Nonfiction - Quite right! That was a typo on my part... (I renamed the functions and forgot to change that part of it.) Thanks! – Mook 12/2, 2011 at 16:59

Is it possible to specify the step size? – Thriller 30/3, 2013 at 18:39

I'm not familiar enough with Python to write out code for that, but the two best ways to speed up convolutions is to either separate the filter or to use the Fourier transform.

Separated filter : Convolution is O(M*N), where M and N are number of pixels in the image and the filter, respectively. Since average filtering with a 3-by-3 kernel is equivalent to filtering first with a 3-by-1 kernel and then a 1-by-3 kernel, you can get (3+3)/(3*3) = ~30% speed improvement by consecutive convolution with two 1-d kernels (this obviously gets better as the kernel gets larger). You may still be able to use stride tricks here, of course.

Fourier Transform : conv(A,B) is equivalent to ifft(fft(A)*fft(B)), i.e. a convolution in direct space becomes a multiplication in Fourier space, where A is your image and B is your filter. Since the (element-wise) multiplication of the Fourier transforms requires that A and B are the same size, B is an array of size(A) with your kernel at the very center of the image and zeros everywhere else. To place a 3-by-3 kernel at the center of an array, you may have to pad A to odd size. Depending on your implementation of the Fourier transform, this can be a lot faster than the convolution (and if you apply the same filter multiple times, you can pre-compute fft(B), saving another 30% of computation time).

Dispend answered 9/2, 2011 at 15:27 Comment(8)

For what it's worth, in python, these are implemented in scipy.ndimage.uniform_filter and scipy.signal.fftconvolve, respectively. – Mook 9/2, 2011 at 15:44

@Jonas: Cool! The seperated filter approach works nicely, as you say it saves more time as the kernel size increases. For a 5000x5000 array, at an 11x11 kernel size, I am getting 7.7s for 2d convolution using ndimage.convolve, and 2.0s for two 1d convolutions using ndimage.convolve1d. For your second solution what is B? – Lupine 9/2, 2011 at 16:2

@Benjamin: I have expanded my explanation of the second solution – Dispend 9/2, 2011 at 16:9

@Joe Kington: Thanks! If I understand the help correctly, fftconvolve doesn't allow you to precompute fft(B), right? – Dispend 9/2, 2011 at 16:10

@Dispend and Joe: What it's worth, but Paul's function is just few % slower than scipy.ndimage.uniform_filter (which I belive is the limit to reasonable expect to reach). Now, I think this discussion is more on how to tweak stride tricks to reach that limit (or at least how I interpeted it, but it just may be aswell more on the generic filtering side). Thanks – Averil 9/2, 2011 at 16:13

@Lupine - uniform_filter already does repeated 1D convolutions, for what it's worth. @Dispend - No, it doesn't... It's just a convince function for doing it once. – Mook 9/2, 2011 at 16:15

@eat: I was mainly responding to the "...or that someone can convince me of another, more efficient method" bit of the question. – Dispend 9/2, 2011 at 16:17

@Averil - Yeah, @Paul's function is quite slick! However, the various ndimage functions let you handle the boundary conditions much more flexibly, and can do it without making a temporary copy, so they're both more flexible and more memory efficient. Nonetheless, @Paul's answer is pretty neat! – Mook 9/2, 2011 at 16:17

Lets see:

It's not so clear form your question, but I'm assuming now that you'll like to improve significantly this kind of averaging.

import numpy as np
from numpy.lib import stride_tricks as st

def mf(A, k_shape= (3, 3)):
    m= A.shape[0]- 2
    n= A.shape[1]- 2
    strides= A.strides+ A.strides
    new_shape= (m, n, k_shape[0], k_shape[1])
    A= st.as_strided(A, shape= new_shape, strides= strides)
    return np.sum(np.sum(A, -1), -1)/ np.prod(k_shape)

if __name__ == '__main__':
    A= np.arange(100).reshape((10, 10))
    print mf(A)

Now, what kind of performance improvements you would actually expect?

Update:
First of all, a warning: the code in it's current state does not adapt properly to the 'kernel' shape. However that's not my primary concern right now (anyway the idea is there allready how to adapt properly).

I have just chosen the new shape of a 4D A intuitively, for me it really make sense to think about a 2D 'kernel' center to be centered to each grid position of original 2D A.

But that 4D shaping may not actually be the 'best' one. I think the real problem here is the performance of summing. One should to be able to find 'best order' (of the 4D A) inorder to fully utilize your machines cache architecture. However that order may not be the same for 'small' arrays which kind of 'co-operates' with your machines cache and those larger ones, which don't (at least not so straightforward manner).

Update 2:
Here is a slightly modified version of mf. Clearly it's better to reshape to a 3D array first and then instead of summing just do dot product (this has the advantage all so, that kernel can be arbitrary). However it's still some 3x slower (on my machine) than Pauls updated function.

def mf(A):
    k_shape= (3, 3)
    k= np.prod(k_shape)
    m= A.shape[0]- 2
    n= A.shape[1]- 2
    strides= A.strides* 2
    new_shape= (m, n)+ k_shape
    A= st.as_strided(A, shape= new_shape, strides= strides)
    w= np.ones(k)/ k
    return np.dot(A.reshape((m, n, -1)), w)

Averil answered 8/2, 2011 at 19:33 Comment(7)

@eat: this is interesting. I see you've taken care of my edge issues, although your filter size is hardcoded ;). In your as_strided line, should that be n, m rather than n, n? – Lupine 8/2, 2011 at 19:48

@eat: I modified your code a little and it works nicely. I can't wrap my head around what is happening though. Could you describe what that as_strided line is doing and why you are picking those values of shape and strides? – Lupine 8/2, 2011 at 19:52

@eat: shouldn't one of those n's be an m? – Montemontefiascone 8/2, 2011 at 20:2

@Bejamin, @Paul: yes some typos, just changed those in my answer. Anyway I'll think the true hot potatoe here is: can we expect a significant improvement from this implementation? I'll try to clarify my answer more later. Thanks – Averil 8/2, 2011 at 20:17

@eat: Seems to be having issues with larger arrays (like 5000x5000), even at a 3x3 kernel size... – Lupine 8/2, 2011 at 21:23

@Benjamin: I actually changed my implementation just 1 min ago. Care to retime? Anyway it now have the advantage that any kind of kernels can be used. (It still operates on 3x3 kernels as an example, but that's trivial to fix). However, there may not exist any stride tricks to outperform scipy.ndimage... Thanks – Averil 9/2, 2011 at 15:24

Anything for steps bigger than one? – Paisa 16/4, 2019 at 13:6

One thing I am confident needs to be fixed is your view array b.

It has a few items from unallocated memory, so you'll get crashes.

Given your new description of your algorithm, the first thing that needs fixing is the fact that you are striding outside the allocation of a:

bshape = (a.size-filtsize+1, filtsize)
bstrides = (a.itemsize, a.itemsize)
b = numpy.lib.stride_tricks.as_strided(a, shape=bshape, strides=bstrides)

Update

Because I'm still not quite grasping the method and there seems to be simpler ways to solve the problem, I'm just going to put this here:

A = numpy.arange(100).reshape((10,10))

shifts = [(-1,-1),(-1,0),(-1,1),(0,-1),(0,1),(1,-1),(1,0),(1,1)]
B = A[1:-1, 1:-1].copy()
for dx,dy in shifts:
    xstop = -1+dx or None
    ystop = -1+dy or None
    B += A[1+dx:xstop, 1+dy:ystop]
B /= 9

...which just seems like the straightforward approach. The only extraneous operation is that it has allocate and populate B only once. All the addition, division and indexing has to be done regardless. If you are doing 16 bands, you still only need to allocate B once if your intent is to save an image. Even if this is no help, it might clarify why I don't understand the problem, or at least serve as a benchmark to time the speedups of other methods. This runs in 2.6 sec on my laptop on a 5k x 5k array of float64's, 0.5 of which is the creation of B

Montemontefiascone answered 8/2, 2011 at 19:31 Comment(3)

I just timed and your method is some 10x faster than mine. The ratio seem to be quite constant (i.e. it's not depending on input size). A hasty conclusion would be that stride_tricks are usefull tricks sometime, but they won't necessary give any performance boost? Alltough there may exists some other tricks than mine to perform much better. Thanks – Averil 9/2, 2011 at 10:18

@eat: For a 5000x5000 array and a 3x3 filter, I time @eat's result at 3.9s, @Paul's result at 1.9s and scipy.ndimage.filters.convolve at 1.4s. At that array size, the strides solution does not work at larger kernel sizes. I'm going to upgrade @Paul's solution to accept variable kernel sizes and compare. But it still seems that scipy.ndimage.filters.convolve is the fastest solution... – Lupine 9/2, 2011 at 14:52

How do you handle NaN values in the 2D array? – Hydrocephalus 2/8, 2019 at 18:2

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags