Fast 7x7 2D median filter in C and C++
Asked Answered
H

4

3

I'm trying to convert the following code from MATLAB to C++:

function data = process(data)
    data = medfilt2(data, [7 7], 'symmetric');
    mask = fspecial('gaussian', [35 35], 12);
    data = imfilter(data, mask, 'replicate', 'same');
    maximum = max(data(:));
    data = 1 ./ (data/maximum);
    data(data > 10) = 16;
end

My problem is in the medfilt2, which is a 2D median filter. I need it to support 10 bits per pixels and more images.

  1. I have looked into OpenCV it has a 5x5 median filter which supports 16 bits, but 7x7 only supports bytes.

    medianBlur

  2. I have also looked into Intel IPP, but I can see only a 1D median filter. https://software.intel.com/en-us/node/502283

Is there a fast implementation for a 2D filter?

I am looking for something like:

  1. Fast Median Search: An ANSI C Implementation using parallel programming and vectorized (AVX/SSE) operations...
  2. Two Dimensional Digital Signal Processing II. Transforms and median filters. Edited by T.S.Huang. Springer-Verlag. 1981.

There are more code examples in Fast median filtering with implementations in C/C++/C#/VB.NET/Delphi.

I also found Median Filtering in Constant Time.

Heteronym answered 23/2, 2016 at 11:25 Comment(16)
@SimonKraemer i don't care if it is openCV or something else. and there is no solution for that question...Heteronym
@Gilad: Could you explain why the linked solution does not apply here?Partly
That implementation is quite slow, but can be easily enhanced. Previosuly I recommended to use an implementation based on histograms, but for 16bit it can be very slow.Shyamal
@Heteronym for each position, you can build an array of 49 elements according to your 7x7 window, and use IPP for the 1D case on that.Shyamal
@Partly i'm looking for a fast! implementation not a trivial one, I can write a cpp version of median filter i'm trying to run over a 2000 images data base and every image is 4k resolution.Heteronym
@Shai this is not a duplicated question, please do remove the mark. i'm looking a for a vectorized and parallel programming c++ solution not a image processing newbie solution. I can write that one on my own.Heteronym
Consider adding SSE and AVX tags to your question.Lumbye
Does it have to be a strictly correct median operation, or would something closely approximating it be acceptable ?Mistrial
@PaulR please feel free to suggest something close to it. but I prefer 2D median, and not average.Heteronym
@Gilad: I was thinking about maybe a "median of medians" approach - do a 7 point median on each row and then a 7 point median on the resulting column. Intuitively I think this should give similar results to a proper 7x7 median, particularly for image data where adjacent pixels tend to be somewhat correlated. You could use a sorting network for the 7 point median - this is simple to implement and vectorizable (e.g. SIMD).Mistrial
Have a look at this: Weiss "Fast median and bilateral filtering", SIGGRAPH '06 dl.acm.org/citation.cfm?id=1141918Trigg
Possible duplicate of Fast Median Filter in C / C++ for `UINT16` 2D ArrayMonstrous
@mainactual, Are you aware of any implementation of Fast Median and Bilateral Filtering in C / C++?Kirk
@Kirk have not seen one unfortunately. The ctmf.c, by Perreault originally, found in Gilad's link below is pretty close when it comes to median blur. OpenCV's OpenCL bilateral blur instead gives a good O(R^2) performance on a decent GPU.Trigg
I'm after Median Filter as in Ben Weiss' method. It seems to be faster for small radii. How about OpenCV's Median? What's the trick behind it?Kirk
The Intel link is (effectively) broken. It redirects to a generic page.Astri
T
7

Motivated by the fact that OpenCV does not implement 16-bit median filter for large kernel sizes (larger than 5), I tried three different strategies.

All of them are based on Huang's [2] sliding window algorithm. That is, the histogram is updated by removing and inserting pixel entries as the window slides from left to right. This is quite straightforward for 8-bit image and already implemented in OpenCV. However, a large 65536 bin histogram makes computation a bit difficult.

...The algorithm still remains O(log r), but storage considerations render it impractical for 16-bit images and impossible for floating-point images. [3]

I used the algorithm C++ standard library where applicable, and did not implement Weiss' additional optimization strategies.

1) A naive sorting implementation. I think this is the best starting point for arbitrary pixel type (floats particularly).

// copy pixels in the sliding window to a temporary vec and
// compute the median value (size is always odd)
memcpy( &v[0], &window[0], window.size() * sizeof(_Type) );
std::vector< _Type >::iterator it = v.begin() + v.size()/2;
std::nth_element( v.begin(), it, v.end() );
return *it;

2) A sparse histogram. We wouldn't want to step over 65536 bins to find the median of each pixel, so how about storing the sparse histogram then? Again, this is suitable for all pixel types, but it doesn't make sense if all pixels in the window are different (e.g. floats).

typedef std::map< _Type, int > Map;
//...
// inside the sliding window, update the histogram as follows
for ( /* pixels to remove */ )
{
    // _Type px
    Map::iterator it = map.find( px );
    if ( it->second > 1 )
        it->second -= 1;
    else
        map.erase( it );
}
// ...
for ( /* pixels to add */ )
{
    // _Type px
    Map::iterator lower = map.lower_bound( px );
    if ( lower != map.end() && lower->first == px )
        lower->second += 1;
    else
        map.insert( lower, std::pair<_Type,int>( px, 1 ) );
}
//... and compute the median by integrating from the one end until
// until the appropriate sum is reached ..

3) A dense histogram. So this is the dense histogram, but instead of a simple 65536 array, we make searching a little easier by dividing it into sub-bins e.g.:

[0...65535] <- px
[0...4095] <- px / 16
[0...255] <- px / 256
[0...15] <- px / 4096

This makes insertion a bit slower (by constant time), but search a lot faster. I found 16 a good number.

comparison

Figure I tested methods (1) red, (2) blue and (3) black against each other and 8bpp OpenCV (green). For all but OpenCV, the input image is 16-bpp gray scale. The dotted lines are truncated at dynamic range [0,255] and smooth lines are truncated at [0, 8020] ( via multiplication by 16 and smoothing to add more variance on pixel values).

Interesting is the divergence of sparse histogram as the variance of pixel values increases. Nth-element is safe bet always, OpenCV is the fastest (if 8bpp is ok) and the dense histogram is trailing behind.

I used Windows 7, 8 x 3.4 GHz and Visual Studio v. 10. Mine were running multithreaded, OpenCV implementation is single-threaded. Input image size 2136x3201 (https://i.sstatic.net/CsGpg.jpg, from Vogue).

[2]: Huang, T: "Two-Dimensional Signal Processing II: Transforms and Median Filters", 1981

[3]: Weiss, B: "Fast median and Bilateral Filtering", 2006

Trigg answered 25/2, 2016 at 13:6 Comment(1)
By the way in Huand article he is talking about binary tree for large historgramsHeteronym
M
2

I just implemented, in DIPlib, an efficient algorithm for computing the median filter (and the more generic percentile filter). This algorithm works for integer images of any bit depth as well as floating-point images, works for images of any number of dimensions, and works for kernels of any shape.

The algorithm is similar to the binary search tree implementation suggested by @mainactual in their answer to this question (as method #2), but uses a more appropriate order statistic tree. @mainactual's implementation needs O(n) to find the median in the search tree, for a tree with n nodes, because it iterates through half the nodes in the tree. This is only efficient if there are many fewer nodes than pixels in the kernel, which is typically only true for integer images with a small bit depth. In contrast, the order statistic tree can find the median value in O(log n), by storing an additional value in each node: the size of the subtree rooted at that node. The filter has a cost of O(k log k) for a compact 2D kernel with a height of k pixels (independent of the width).

I wrote down a more detailed description of the algorithm in my blog.

The C++ code is available on GitHub.

Here is a timing comparison for square kernels, comparing:

  • the new implementation in DIPlib (blue),
  • the naive implementation in scikit-image (which computes the median for each pixel's neighborhood independently, method #1 in @mainactual's answer, with a quadratic cost) (green), and
  • the O(1) implementation in OpenCV that only works for 8-bit images and square kernels (red).

"SFLOAT" stands for single-precision floating-point, "UINT8" stands for 8-bit unsigned integer, and "0-10" is also 8-bit unsigned integer, but containing only pixel values between 0 and 10 (this one tests what happens when there are many repeated values in each neighborhood).

Timing comparison for various implementations of the median filter available in Python.

The new implementation in DIPlib kicks in at k = 13, the lower part of the graph is the naive, quadratic cost algorithm.

Monstrous answered 28/6, 2022 at 16:11 Comment(0)
H
1

I found this online. It is the same algorithm which OpenCV has. However, it is extended to 16 bit and optimized to SSE.

medianFilter.c

Heteronym answered 12/3, 2016 at 21:10 Comment(0)
K
1

I happened to find (my) solution online as open source (image-quality-and-characterization-utilities, from include/Teisko/Image/Algorithm.hpp

The algorithm finds the Kth element of any set of size M<=64 in N steps, where N is the number of bits in the elements.

This is a radix-2 sort algorithm, which needs the original bit pattern int16_t data[7][7]; to be transposed into N planes of uint64_t bits[N] (10 for 10-bit images), with the MSB first.

    // Runs N iterations for pixel data as bit planes in `bits`
    // to recover the K(th) largest item (as set by initial threshold)
    // The parameter `mask` must be initialized to contain a set bit for all those bits
    // of interest in the corresponding pixel data bits[0..N-1]
    template <int N> inline
    uint64_t median8x8_iteration(uint64_t(&bits)[N], uint64_t mask, uint64_t threshold)
    {
        uint64_t result = 0;
        int i = 0;
        do
        {
            uint64_t ones = mask & bits[i];
            uint64_t ones_size = popcount(ones);
            uint64_t mask_size = popcount(mask);
            auto zero_size = mask_size - ones_size;
            int new_bit = 0;
            if (zero_size < threshold)
            {
                new_bit = 1;
                threshold -= zero_size;
                mask = 0;
            }
            result = result * 2 + new_bit;
            mask ^= ones;
        } while (++i < N);
        return result;
    }

Use threshold = 25 to get median of 49 and mask = 0xfefefefefefefe00ull in the case the planes bits[] contain the support for 8x8 adjacent bits.

Toggling the MSB plane one can use the same inner loop for signed integers - and toggling conditionally MSB and the other planes one can use the algorithm for floating points as well.

Much after year 2016, Ice Lake with AVX-512 introduced _mm256_mask_popcnt_epi64 even on consumer machines, allowing the inner loop to be almost trivially vectorised for all the four submatrices in the common 8x8 support; the masks would be 0xfefefefefefefe00ull >> {0,1,8,9}.

The idea here is that the mask marks the set of pixels under inspection. Counting the number of ones (or zeros) in that set and comparing to a threshold, we can determine at each step if the Kth element belongs to the set of ones or zeros, producing also one correct output bit.

EDIT

Another method I tried was SSE2 version where one keeps a window of size Width*(7 + 1) sorted by rows:

original               sorted
1  2  1  3 ....        1  1  0
2  1  3  4 ....        2  2  1
5  2  0  1 ....  ->    5  2  3
.  .  .  . ....        .  .  .

Sorting 7 rows is efficiently done by a sorting network using 16 primitive sorting operations (32 instructions with 3-parameter VEX encoding + 14 instructions for memory access).

One can also incrementally remove element from input[row-1][column] from a presorted SSE2 register and add an element from input[row+7][column] to the register (which takes about 12 instructions per sorted column).

Having 7 sorted columns in 7 SSE2 registers, one can now implement bitonic merge sort of three different widths, which at first column will sort in groups of

 (r0),(r1,r2), ((r3,r4), (r5,r6))
      <merge>   <merge>  <merge>     // level #0
 <--merge--->  <---- merge* ---->    // level #1
 <----- merge + take middle ---->    // partial level #2

At column 1, one needs to sort columns

      (r1,r2), ((r3,r4),(r5,r6)), (r7)
      *******                     ****  <-- merge (r1,r2,r7)
      <----- merge + take middle ---->  <-- partial level #2

At column 2, with

          r2,  ((r3,r4),(r5,r6)), (r7,r8)
                                  <merge>   // level #0
          **                      *******   // level #1
          <----- merge + take middle ---->  // level #2

This takes advantage of memorizing partially sorted substrings (and does it better than e.g. heap based priority queue). The final merging of the 3x7 and 4x7 element substrings does not need to compute every element correctly, since we are only interested in the element #24.

Still, while being better than (my implementations) of heap based priority queues and several hierarchical / flat histogram based methods, the overall winner was the popcount method (with a 64-bit popcount instruction).

Kierkegaardian answered 19/6, 2022 at 7:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.