Fast transposition of an image and Sobel Filter optimization in C (SIMD)

Asked 13/8, 2013 at 19:13 Answered 7/11, 2020 at 10:8

I want to implement a really (really) fast Sobel operator for a ray-tracer a friend of me and I wrote (sources can be found here). What follows is what I figure out so far...

First, let assume the image is a grey-scale picture store line by line in an 8 bits unsigned integers array.

To write a real Sobel filter, I need to compute Gx and Gy for each pixel. Each of these numbers are computed thanks to 6 pixels next to the origin. But SIMD instruction allow me to deal with 16 or even 32 (AVX) pixels. Hopefully, the kernel of the operator has some nice property, so that I can compute Gy by :

subtracting each i and i+2 rows and store the result in a i+1 row of some other picture (array)
adding the i, twice of the i+1 and the i+2 columns give the i+1 column of the final picture

I would do the same (but transposed) to compute Gx then add the two pictures.

Some notes :

I don't care about memory allocation since everything will be allocated at the beginning.
I can deal with the overflow and sign problem dividing the values by four (thanks to _mm_srli_epi8) (uint8_t >> 2 - uint8_t >> 2) = int7_t //really store as int8_t int7_t + uint8_t << 1 >> 2 + int7_t = uint8_t //some precision is lost but I don't care

The real problem I'm facing is to go from the rows to the columns. Since I couldn't load the picture in the SIMD register otherwise. I must flip the image three time at least isn't it ?

Once the original picture. Then I can compute the first step for Gx and Gy and then flip the resulting pictures to compute the second step.

So, here is my questions :

Is this kind of implementation a good idea ?
Is there a way to transpose an array faster than the dumb algorithm ? (I don't think so)
Where will be the bottlenecks ? (any guess ? :P)

Branham answered 13/8, 2013 at 19:13 Comment(11)

This thread What is the fastest way to transpose a matrix in C++? has some good material in it and you may find it useful, most of it is applicable to C. – Grumble 13/8, 2013 at 19:27

Thank you. Of course I cannot afford to "change my point of view" since I must load these data into the simd registers. But OpenMP...I will read this futher. – Branham 13/8, 2013 at 19:31

This. Is. Great. SSE by block. I didn't know the _MM_TRANSPOSE4_PS which is just a bunch of shuffles. Thanks again ! – Branham 13/8, 2013 at 19:46

I looked into something similar with a Gaussian Filter (also for a ray tracer). I did the transpose three times like you say. I use _MM_TRANSPOSE4_PS along with loop blocking as described in https://mcmap.net/q/15727/-what-is-the-fastest-way-to-transpose-a-matrix-in-c. In the end the transpose was still the bottleneck unless the kernel size was large. – Capri 14/8, 2013 at 7:32

This may be helpful software.intel.com/en-us/articles/… – Capri 14/8, 2013 at 7:33

To answer one of your question. It is possible to do a faster transpose than the dumb algorithm. Try transpose_block_SSE4x4. I did not get any improvement however using AVX. You can see an example using AVX here Fast memory transpose with SSE, AVX, and OpenMP. – Capri 14/8, 2013 at 7:56

Glad to read it's a classic way to compute convolutions. Since the kernel is small (2 then 3), I think the transpositions will clearly be the bottlenecks. Maybe I could use the SIMD only for the first step...I will benchmark various solutions. I am curious : what performance % hardware did you get ? – Branham 14/8, 2013 at 17:32

Small correction, I think I only needed two transposes. In terms of performance you can estimate the flops for the filter and calculate the time to get the flops/s. Compare this to the peak flops of your processor. I think I was getting 10-20% of the peak. For SGEMM I got over 70% eventually. I don't know what a reasonable efficiency for convolution filters is but I know for SGEMM processors can get over 80% efficiency. The Gaussian convolution is separable and I don't think Sorbel is so that's one important difference. – Capri 14/8, 2013 at 18:51

Just noticed your kernel size. Yes, a size of 2 or 3 is very small. The transpose is going to be the bottleneck for that. For 1000x1000 images I think I needed kernel sizes on the order of 50 or something for the the convolution time to dominate. – Capri 14/8, 2013 at 18:54

You needed 2 transpositions because you had only one kernel for the gaussian blur isn't it ? I think i have a nicer idéal. Rather than transposing the picture i will load 16 pixels horizontaly and then do 2 shuffles to perform thé horizontal sum...and try to deal with thé anoying side effects. I hâte this f. Android keybord trying to mâle french from english. I will work on it tomorrow... – Branham 14/8, 2013 at 20:47

There are several ways you can do it. You probably only care about good enough. But keep in mind you can calculate the flops. It should be something like 6*widthheightkernel_width*kernel_width (not sure that's correct 6 = 3(rgb)*2(mul+add)). Then you can compare to the peak flops/s of you processor. That will tell you how good your algorithm is. The nice thing about the transpose is that it's independent of the kernel_width so it has a fixed time and the larger the kernel_width gets the less it effects the total calculation time. – Capri 15/8, 2013 at 7:36

I think transpose/2-pass is not good for optimizing Sobel Operator code. Sobel Operator is not computational function, so wasting memory access for transpose/2-pass access is not good for this case. I wrote some Sobel Operator test codes to see how fast SSE can get. this code does not handle first and last edge pixels, and use FPUs to calculate sqrt() value.

Sobel operator need 14 multiply, 1 square root, 11 addition, 2 min/max, 12 read access and 1 write access operators. This means you can process a component in 20~30 cycle if you optimize code well.

FloatSobel() function took 2113044 CPU cycles to process 256 * 256 image processing 32.76 cycle/component. I'll convert this sample code to SSE.

void FPUSobel()
{
    BYTE* image_0 = g_image + g_image_width * 0;
    BYTE* image_1 = g_image + g_image_width * 1;
    BYTE* image_2 = g_image + g_image_width * 2;
    DWORD* screen = g_screen + g_screen_width*1;

    for(int y=1; y<g_image_height-1; ++y)
    {
        for(int x=1; x<g_image_width-1; ++x)
        {
            float gx =  image_0[x-1] * (+1.0f) + 
                        image_0[x+1] * (-1.0f) +
                        image_1[x-1] * (+2.0f) + 
                        image_1[x+1] * (-2.0f) +
                        image_2[x-1] * (+1.0f) + 
                        image_2[x+1] * (-1.0f);

            float gy =  image_0[x-1] * (+1.0f) + 
                        image_0[x+0] * (+2.0f) + 
                        image_0[x+1] * (+1.0f) +
                        image_2[x-1] * (-1.0f) + 
                        image_2[x+0] * (-2.0f) + 
                        image_2[x+1] * (-1.0f);


            int result = (int)min(255.0f, max(0.0f, sqrtf(gx * gx + gy * gy)));

            screen[x] = 0x01010101 * result;
        }
        image_0 += g_image_width;
        image_1 += g_image_width;
        image_2 += g_image_width;
        screen += g_screen_width;
    }
}

SseSobel() function took 613220 CPU cycle to process same 256*256 image. It took 9.51 cycle/component and 3.4 time faster than FPUSobel(). There are some spaces to optimize but it will not faster than 4 times because it used 4-way SIMD.

This function used SoA approach to process 4 pixels at once. SoA is better than AoS in most array or image datas because you have to transpose/shuffle to use AoS. And SoA is far easier changing common C code to SSE codes.

void SseSobel()
{
    BYTE* image_0 = g_image + g_image_width * 0;
    BYTE* image_1 = g_image + g_image_width * 1;
    BYTE* image_2 = g_image + g_image_width * 2;
    DWORD* screen = g_screen + g_screen_width*1;

    __m128 const_p_one = _mm_set1_ps(+1.0f);
    __m128 const_p_two = _mm_set1_ps(+2.0f);
    __m128 const_n_one = _mm_set1_ps(-1.0f);
    __m128 const_n_two = _mm_set1_ps(-2.0f);

    for(int y=1; y<g_image_height-1; ++y)
    {
        for(int x=1; x<g_image_width-1; x+=4)
        {
            // load 16 components. (0~6 will be used)
            __m128i current_0 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_0+x-1)), _mm_setzero_si128());
            __m128i current_1 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_1+x-1)), _mm_setzero_si128());
            __m128i current_2 = _mm_unpacklo_epi8(_mm_loadu_si128((__m128i*)(image_2+x-1)), _mm_setzero_si128());

            // image_00 = { image_0[x-1], image_0[x+0], image_0[x+1], image_0[x+2] }
            __m128 image_00 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_0, _mm_setzero_si128()));
            // image_01 = { image_0[x+0], image_0[x+1], image_0[x+2], image_0[x+3] }
            __m128 image_01 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_0, 2), _mm_setzero_si128()));
            // image_02 = { image_0[x+1], image_0[x+2], image_0[x+3], image_0[x+4] }
            __m128 image_02 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_0, 4), _mm_setzero_si128()));
            __m128 image_10 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_1, _mm_setzero_si128()));
            __m128 image_12 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_1, 4), _mm_setzero_si128()));
            __m128 image_20 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(current_2, _mm_setzero_si128()));
            __m128 image_21 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_2, 2), _mm_setzero_si128()));
            __m128 image_22 = _mm_cvtepi32_ps(_mm_unpacklo_epi16(_mm_srli_si128(current_2, 4), _mm_setzero_si128()));

            __m128 gx = _mm_add_ps( _mm_mul_ps(image_00,const_p_one),
                        _mm_add_ps( _mm_mul_ps(image_02,const_n_one),
                        _mm_add_ps( _mm_mul_ps(image_10,const_p_two),
                        _mm_add_ps( _mm_mul_ps(image_12,const_n_two),
                        _mm_add_ps( _mm_mul_ps(image_20,const_p_one),
                                    _mm_mul_ps(image_22,const_n_one))))));

            __m128 gy = _mm_add_ps( _mm_mul_ps(image_00,const_p_one), 
                        _mm_add_ps( _mm_mul_ps(image_01,const_p_two), 
                        _mm_add_ps( _mm_mul_ps(image_02,const_p_one),
                        _mm_add_ps( _mm_mul_ps(image_20,const_n_one), 
                        _mm_add_ps( _mm_mul_ps(image_21,const_n_two), 
                                    _mm_mul_ps(image_22,const_n_one))))));

            __m128 result = _mm_min_ps( _mm_set1_ps(255.0f), 
                            _mm_max_ps( _mm_set1_ps(0.0f), 
                                        _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(gx, gx), _mm_mul_ps(gy,gy))) ));

            __m128i pack_32 = _mm_cvtps_epi32(result); //R32,G32,B32,A32
            __m128i pack_16 = _mm_packs_epi32(pack_32, pack_32); //R16,G16,B16,A16,R16,G16,B16,A16
            __m128i pack_8 = _mm_packus_epi16(pack_16, pack_16); //RGBA,RGBA,RGBA,RGBA
            __m128i unpack_2 = _mm_unpacklo_epi8(pack_8, pack_8); //RRGG,BBAA,RRGG,BBAA
            __m128i unpack_4 = _mm_unpacklo_epi8(unpack_2, unpack_2); //RRRR,GGGG,BBBB,AAAA

            _mm_storeu_si128((__m128i*)(screen+x),unpack_4);
        }
        image_0 += g_image_width;
        image_1 += g_image_width;
        image_2 += g_image_width;
        screen += g_screen_width;
    }
}

Crispate answered 15/8, 2013 at 9:47 Comment(2)

Thank you, but in any way I will write my own (and then benchmark it against yours). In fact, I am not looking for a perfect Sobel operator : I think I will compute the length_1 instead of the euclidean norm and process 16 pixels (8 bits pixels) at the same time using two pictures. I plan to reduce RGBA picture using two _mm_avg_epu8 and then apply the 8 bits Sobel filter. – Branham 15/8, 2013 at 11:6

Surely you can omit the max() there, as the value should never be less than 0? – Canadianism 15/4, 2019 at 22:39

For the code in @zupet's answer:
Instead of multiplying by one (const_p_one), I would do .... nothing. Compilers might not optimize that away.
Instead of multiplying by two, I would add by self; faster than mul with integer arithm. But with FP, it mostly just avoids needing another vector constant. Haswell has worse FP add throughput than FP mul, but Skylake and Zen are balanced.

Instead of multiplying by -1.0, negate with _mm_xor_ps with -0.0 to flip the sign bit.

I would compute pos and neg terms independently and side by side in parallel rather than one after the other (for better pipelining) with same arithm and sub only at the end. etc etc ... still a lot improvements pending

With AVX+FMA available, _mm_fma_ps can be much faster.

Ironhanded answered 7/11, 2020 at 10:8 Comment(0)

Recommended topics

Hot tags