Optimization of image resizing (method Nearest) with using SIMD

Asked 6/12, 2021 at 11:6 Answered 6/12, 2021 at 16:44

Solved c++image-processing simd simd-library synet

I know that 'Nearest' method of image resizing is the fastest method. Nevertheless I search way to speed up it. Evident step is a precalculate indices:

void CalcIndex(int sizeS, int sizeD, int colors, int* idx)
{
    float scale = (float)sizeS / sizeD;
    for (size_t i = 0; i < sizeD; ++i)
    {
        int index = (int)::floor((i + 0.5f) * scale)
        idx[i] = Min(Max(index, 0), sizeS - 1) * colors;
    }
}

template<int colors> inline void CopyPixel(const uint8_t* src, uint8_t* dst)
{
    for (int i = 0; i < colors; ++i)
        dst[i] = src[i];
}

template<int colors> void Resize(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices (see CalcIndex).
    for (int dy = 0; dy < dstH; dy++)
    {
        const uint8_t * srcY = src + idxY[dy] * srcW * colors;
        for (int dx = 0, offset = 0; dx < dstW; dx++, offset += colors)
            CopyPixel<N>(srcY + idxX[dx], dst + offset);
        dst += dstW * colors;
    }
}

Are the next optimization steps exist? For example with using SIMD or some other optimization technic.

P.S. Especially I am interesting in optimization of RGB (Colors = 3). And if I use current code I see that ARGB image (Colors = 4) is processing faster for 50% then RGB despite that it bigger for 30%.

Pelagian answered 6/12, 2021 at 11:6 Comment(5)

If you're really concerned about performance I'd look into doing this hardware accelerated, so using e.g. DirectX or Vulkan. – Quoit 6/12, 2021 at 11:12

@Savage 'Nearest' method is fast. I fear that copying of image memory to GPU and back will be slower if I will perform all calculations on CPU. – Pelagian 6/12, 2021 at 11:29

You can look at how opencv does it: github.com/opencv/opencv/blob/master/modules/imgproc/src/… github.com/opencv/opencv/blob/master/modules/imgproc/src/… For some formats, opencv also either calls an IPP function or an implementation without explicit SIMD code. At the very least the opencv/IPP implementations should give you a good baseline to benchmark against if you really want to try to write it yourself. – Regelate 6/12, 2021 at 11:30

Where do you calculate idxY and idxX? It is important to know where these come from and how they are sorted to best help you. – Ineffaceable 6/12, 2021 at 14:53

@CrisLuengo I added function with index calculation. – Pelagian 7/12, 2021 at 8:4

I think that using of _mm256_i32gather_epi32 (AVX2) can give some performance gain for resizing in case of 32 bit pixels:

inline void Gather32bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
    __m256i _idx = _mm256_loadu_si256((__m256i*)idx);
    __m256i val = _mm256_i32gather_epi32((int*)src, _idx, 1);
    _mm256_storeu_si256((__m256i*)dst, val);
}

template<> void Resize<4>(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices.
    size_t dstW8 = dstW & (8 - 1);
    for (int dy = 0; dy < dstH; dy++)
    {
        const uint8_t * srcY = src + idxY[dy] * srcW * 4;
        int dx = 0, offset = 0;
        for (; dx < dstW8; dx += 8, offset += 8*4)
            Gather32bit(srcY, idxX + dx,dst + offset);
        for (; dx < dstW; dx++, offset += 4)
            CopyPixel<N>(srcY + idxX[dx], dst + offset);
        dst += dstW * 4;
    }
}

P.S. After some modification this method can be applied to RGB24:

const __m256i K8_SHUFFLE = _mm256_setr_epi8(
    0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1,
    0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1);
const __m256i K32_PERMUTE = _mm256_setr_epi32(0x0, 0x1, 0x2, 0x4, 0x5, 0x6, -1, -1);


inline void Gather24bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
    __m256i _idx = _mm256_loadu_si256((__m256i*)idx);
    __m256i bgrx = _mm256_i32gather_epi32((int*)src, _idx, 1);
    __m256i bgr = _mm256_permutevar8x32_epi32(
        _mm256_shuffle_epi8(bgrx, K8_SHUFFLE), K32_PERMUTE);
    _mm256_storeu_si256((__m256i*)dst, bgr);
}

template<> void Resize<3>(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices.
    size_t dstW8 = dstW & (8 - 1);
    for (int dy = 0; dy < dstH; dy++)
    {
        const uint8_t * srcY = src + idxY[dy] * srcW * 3;
        int dx = 0, offset = 0;
        for (; dx < dstW8; dx += 8, offset += 8*3)
            Gather24bit(srcY, idxX + dx,dst + offset);
        for (; dx < dstW; dx++, offset += 3)
            CopyPixel<3>(srcY + idxX[dx], dst + offset);
        dst += dstW * 3;
    }
}

Note that if srcW < dstW then method of @Aki-Suihkonen is faster.

Timber answered 6/12, 2021 at 12:26 Comment(3)

I'd expect _mm256_i32gather_epi32 would only help if resizing to such a small size that you're only horizontally keeping 1 of every 4 source pixels or less, for RGBA. (So a normal non-gather vector load would still only get 1 useful pixel). Or maybe with fast gathers (like Skylake) it could save enough shuffle work to be worth it. Zen2/3 has slower gathers and IIRC better shuffle throughput. – Lenient 6/12, 2021 at 18:15

@PeterCordes I performed measuremt on Skylake. Using of _mm256_i32gather_epi32 for RGB gives performance gain 2.0-2.5 (compare to scalar code) in case srcW > dstW. – Timber 7/12, 2021 at 7:50

Ok, fair enough, but probably only optimal if there are less than 2 pixels on average in contiguous 32-byte chunks of source data. I guess doing gather loads lets you do one contiguous store and only bottleneck on load throughput (2/clock) instead of store (1/clock on SKL), as well as reduce loop overhead vs. scalar, so I'm not shocked there's a speedup vs. scalar. But for example a 2:1 downscale where you take every other pixel from every other row should be even faster than that with a specialized implementation for that ratio. Maybe even with Soonts' non-specialized shuffle generator. – Lenient 7/12, 2021 at 7:57

The speed problem in (SIMD-based) resize algorithms comes from the mismatch of indexing input and output elements. When e.g. the resize factor is 6/5, one needs to consume 6 pixels and write 5. OTOH SIMD register width of 16 bytes maps to either 16 grayscale elements, 4 RGBA-elements or 5.33 RGB-elements.

My experience is that a sufficiently good performance (maybe not optimal, but often beating opencv and other freely available implementations) comes when trying to write 2-4 SIMD registers worth of data at a time, reading the required number of linear bytes from the input + some, and using pshufb in x86 SSSE3 or vtbl in Neon to gather load from the registers -- never from memory. One needs of course a fast mechanism to either calculate the LUT indices inline, or to precalculate the indices, which are shared between different output rows.

One should prepare to have several inner kernels depending on the input/output ratio of the (horizontal) resolution.

RGBrgbRGBrgbRGBr|gbRGBrgb ....  <- input
                         ^ where to load next 32 bytes of input
RGBRGBrgbRGBrgbr|gbRGBrgbRGBRGBrg| <- 32 output bytes, from 

0000000000000000|0000001111111111| <- high bit of index
0120123456789ab9|abcdef0123423456| <- low 4 bits of index

Notice, that one can handle with the LUT method all channel counts

// inner kernel for downsampling between 1x and almost 2x*
// - we need to read max 32 elements and write 16
void process_row_ds(uint8_t const *input, uint8_t const *indices,
                 int const *advances, uint8_t *output, int out_width) {
    do {
       auto a = load16_bytes(input);
       auto b = load16_bytes(input + 16);
       auto c = load16_bytes(indices);
       a = lut32(a,b,c);      // get 16 bytes out of 32
       store16_bytes(output, a);
       output += 16;
       input += *advances++;
    } while (out_width--);  // multiples of 16...
}

// inner kernel for upsampling between 1x and inf
void process_row_us(uint8_t const *input, uint8_t const *indices,
                 int const *advances, uint8_t *output, int out_width) {
    do {
       auto a = load16_bytes(input);
       auto c = load16_bytes(indices);
       a = lut16(a, c);      // get 16 bytes out of 16
       store16_bytes(output, a);
       output += 16;
       input += *advances++;
    } while (out_width--);
}

I would also encourage to use some elementary filtering for downsampling, such as gaussian binomial kernels (1 1, 1 2 1, 1 3 3 1, 1 4 6 4 1, ...) along with hierarchical downsampling in addition to (at least) bilinear interpolation. It's of course possible that the application will tolerate aliasing artifacts -- the cost AFAIK is often not that large, especially given that otherwise the algorithm will be memory bound.

Raman answered 6/12, 2021 at 14:35 Comment(0)

I think that using of _mm256_i32gather_epi32 (AVX2) can give some performance gain for resizing in case of 32 bit pixels:

inline void Gather32bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
    __m256i _idx = _mm256_loadu_si256((__m256i*)idx);
    __m256i val = _mm256_i32gather_epi32((int*)src, _idx, 1);
    _mm256_storeu_si256((__m256i*)dst, val);
}

template<> void Resize<4>(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices.
    size_t dstW8 = dstW & (8 - 1);
    for (int dy = 0; dy < dstH; dy++)
    {
        const uint8_t * srcY = src + idxY[dy] * srcW * 4;
        int dx = 0, offset = 0;
        for (; dx < dstW8; dx += 8, offset += 8*4)
            Gather32bit(srcY, idxX + dx,dst + offset);
        for (; dx < dstW; dx++, offset += 4)
            CopyPixel<N>(srcY + idxX[dx], dst + offset);
        dst += dstW * 4;
    }
}

P.S. After some modification this method can be applied to RGB24:

const __m256i K8_SHUFFLE = _mm256_setr_epi8(
    0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1,
    0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1);
const __m256i K32_PERMUTE = _mm256_setr_epi32(0x0, 0x1, 0x2, 0x4, 0x5, 0x6, -1, -1);


inline void Gather24bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
    __m256i _idx = _mm256_loadu_si256((__m256i*)idx);
    __m256i bgrx = _mm256_i32gather_epi32((int*)src, _idx, 1);
    __m256i bgr = _mm256_permutevar8x32_epi32(
        _mm256_shuffle_epi8(bgrx, K8_SHUFFLE), K32_PERMUTE);
    _mm256_storeu_si256((__m256i*)dst, bgr);
}

template<> void Resize<3>(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices.
    size_t dstW8 = dstW & (8 - 1);
    for (int dy = 0; dy < dstH; dy++)
    {
        const uint8_t * srcY = src + idxY[dy] * srcW * 3;
        int dx = 0, offset = 0;
        for (; dx < dstW8; dx += 8, offset += 8*3)
            Gather24bit(srcY, idxX + dx,dst + offset);
        for (; dx < dstW; dx++, offset += 3)
            CopyPixel<3>(srcY + idxX[dx], dst + offset);
        dst += dstW * 3;
    }
}

Note that if srcW < dstW then method of @Aki-Suihkonen is faster.

Timber answered 6/12, 2021 at 12:26 Comment(3)

@PeterCordes I performed measuremt on Skylake. Using of _mm256_i32gather_epi32 for RGB gives performance gain 2.0-2.5 (compare to scalar code) in case srcW > dstW. – Timber 7/12, 2021 at 7:50

It’s possible to use SIMD, and I’m pretty sure it will help, unfortunately it’s relatively hard. Below is a simplified example which only supports image enlargements but not shrinking.

Still, I hope it might be useful as a starting point.

Both MSVC and GCC compile the hot loop in LineResize::apply method into 11 instructions. I think 11 instructions for 16 bytes should be faster than your version.

#include <stdint.h>
#include <emmintrin.h>
#include <tmmintrin.h>
#include <vector>
#include <array>
#include <assert.h>
#include <stdio.h>

// Implements nearest neighbor resize method for RGB24 or BGR24 bitmaps
class LineResize
{
    // Each mask produces up to 16 output bytes.
    // For enlargement exactly 16, for shrinking up to 16, possibly even 0.
    std::vector<__m128i> masks;

    // Length is the same as masks.
    // For enlargement, the values contain source pointer offsets in bytes.
    // For shrinking, the values contain destination pointer offsets in bytes.
    std::vector<uint8_t> offsets;

    // True if this class will enlarge images, false if it will shrink the width of the images.
    bool enlargement;

    void resizeFields( size_t vectors )
    {
        masks.resize( vectors, _mm_set1_epi32( -1 ) );
        offsets.resize( vectors, 0 );
    }

public:

    // Compile the shuffle table. The arguments are line widths in pixels.
    LineResize( size_t source, size_t dest );

    // Apply the algorithm to a single line of the image.
    void apply( uint8_t* rdi, const uint8_t* rsi ) const;
};

LineResize::LineResize( size_t source, size_t dest )
{
    const size_t sourceBytes = source * 3;
    const size_t destBytes = dest * 3;
    assert( sourceBytes >= 16 );
    assert( destBytes >= 16 );

    // Possible to do much faster without any integer divides.
    // Optimizing this sample for simplicity.
    if( sourceBytes < destBytes )
    {
        // Enlarging the image, each SIMD vector consumes <16 input bytes, produces exactly 16 output bytes
        enlargement = true;
        resizeFields( ( destBytes + 15 ) / 16 );

        int8_t* pMasks = (int8_t*)masks.data();
        uint8_t* const pOffsets = offsets.data();

        int sourceOffset = 0;
        const size_t countVectors = masks.size();
        for( size_t i = 0; i < countVectors; i++ )
        {
            const int destSlice = (int)i * 16;
            std::array<int, 16> lanes;
            int lane;
            for( lane = 0; lane < 16; lane++ )
            {
                const int destByte = destSlice + lane;  // output byte index
                const int destPixel = destByte / 3; // output pixel index
                const int channel = destByte % 3;   // output byte within pixel
                const int sourcePixel = destPixel * (int)source / (int)dest; // input pixel
                const int sourceByte = sourcePixel * 3 + channel;   // input byte

                if( destByte < (int)destBytes )
                    lanes[ lane ] = sourceByte;
                else
                {
                    // Destination offset out of range, i.e. the last SIMD vector
                    break;
                }
            }

            // Produce the offset
            if( i == 0 )
                assert( lanes[ 0 ] == 0 );
            else
            {
                const int off = lanes[ 0 ] - sourceOffset;
                assert( off >= 0 && off <= 16 );
                pOffsets[ i - 1 ] = (uint8_t)off;
                sourceOffset = lanes[ 0 ];
            }

            // Produce the masks
            for( int j = 0; j < lane; j++ )
                pMasks[ j ] = (int8_t)( lanes[ j ] - sourceOffset );
            // The masks are initialized with _mm_set1_epi32( -1 ) = all bits set,
            // no need to handle remainder for the last vector.
            pMasks += 16;
        }
    }
    else
    {
        // Shrinking the image, each SIMD vector consumes 16 input bytes, produces <16 output bytes
        enlargement = false;
        resizeFields( ( sourceBytes + 15 ) / 16 );

        // Not implemented, but the same idea works fine for this too.
        // The only difference, instead of using offsets bytes for source offsets, use it for destination offsets.
        assert( false );
    }
}

void LineResize::apply( uint8_t * rdi, const uint8_t * rsi ) const
{
    const __m128i* pm = masks.data();
    const __m128i* const pmEnd = pm + masks.size();
    const uint8_t* po = offsets.data();
    __m128i mask, source;

    if( enlargement )
    {
        // One iteration of the loop produces 16 output bytes
        // In MSVC results in 11 instructions for 16 output bytes.
        while( pm < pmEnd )
        {
            mask = _mm_load_si128( pm );
            pm++;

            source = _mm_loadu_si128( ( const __m128i * )( rsi ) );
            rsi += *po;
            po++;

            _mm_storeu_si128( ( __m128i * )rdi, _mm_shuffle_epi8( source, mask ) );
            rdi += 16;
        }
    }
    else
    {
        // One iteration of the loop consumes 16 input bytes
        while( pm < pmEnd )
        {
            mask = _mm_load_si128( pm );
            pm++;

            source = _mm_loadu_si128( ( const __m128i * )( rsi ) );
            rsi += 16;

            _mm_storeu_si128( ( __m128i * )rdi, _mm_shuffle_epi8( source, mask ) );
            rdi += *po;
            po++;
        }
    }
}

// Utility method to print RGB pixel values from the vector
static void printPixels( const std::vector<uint8_t>&vec )
{
    assert( !vec.empty() );
    assert( 0 == ( vec.size() % 3 ) );

    const uint8_t* rsi = vec.data();
    const uint8_t* const rsiEnd = rsi + vec.size();
    while( rsi < rsiEnd )
    {
        const uint32_t r = rsi[ 0 ];
        const uint32_t g = rsi[ 1 ];
        const uint32_t b = rsi[ 2 ];
        rsi += 3;
        const uint32_t res = ( r << 16 ) | ( g << 8 ) | b;
        printf( "%06X ", res );
    }
    printf( "\n" );
}

// A triviual test to resize 24 pixels -> 32 pixels
int main()
{
    constexpr int sourceLength = 24;
    constexpr int destLength = 32;

    // Initialize sample input with 24 RGB pixels
    std::vector<uint8_t> input( sourceLength * 3 );
    for( size_t i = 0; i < input.size(); i++ )
        input[ i ] = (uint8_t)i;

    printf( "Input: " );
    printPixels( input );

    // That special handling of the last pixels of last line is missing from this example.
    static_assert( 0 == destLength % 16 );
    LineResize resizer( sourceLength, destLength );

    std::vector<uint8_t> result( destLength * 3 );
    resizer.apply( result.data(), input.data() );

    printf( "Output: " );
    printPixels( result );
    return 0;
}

The code ignores alignment issues. For production, you’d want another method for the last line of the image which doesn’t run to the end, instead handles the last few pixels with scalar code.

The code contains more memory references in the hot loop. However, the two vectors in that class are not too long, for 4k images the size is about 12kb, should fit in L1D cache and stay there.

If you have AVX2, will probably improve things further. For enlarging images, use _mm256_inserti128_si256, the vinserti128 instruction can load 16 bytes from memory into high half of the vector. Similarly, for downsizing images, use _mm256_extracti128_si256, the instruction has an option to use memory destination.

Sedulous answered 6/12, 2021 at 16:44 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags