How to convert byte array of image pixels data to grayscale using vector SSE operation
Asked Answered
J

2

9

I have a problem with converting the image data stored in byte[] array to grayscale. I want to use vector SIMD operations because in future a need to write ASM and C++ DLL files to measure operations time.

When I read about SIMD I found that SSE command is operation on 128-bit registers so there is a problem because I need to convert my byte[] array into few Vector<T> stored into List<T>.

Image is four channels RGBA JPEG so I need also to know how to create vectors with R, G, B data based on single 128-bit Vector<T>. After that, I can use the Grayscale algorithm

fY(R, G, B) = R x 0.29891 + G x 0.58661 + B x 0.11448


All in all the questions are:

  1. How to load chunks of byte[] array into 128-bit registers Vector<T>.
  2. How to separate for one Vector<T> the R, G, B value to multiply it and copy to source Vector.
Jonahjonas answered 15/11, 2019 at 16:45 Comment(7)
I think you're going to need to unpack to at least 16-bit elements for a fixed-point approximation of those constants, or maybe even float. IDK what the minimum "rounding error" you could achieve for the final 8-bit pixel value is, with well-chosen 16-bit fixed-point stuff. Perhaps less than 1 unit, but probably not as low as 0.5 units. i.e. there'd probably be some inputs that would round up instead of down or vice versa, compared to the best possible result.Cenogenesis
Can you use SSE4.1? Or are you limited to SSE2 or SSSE3?Cenogenesis
Is System.Runtime.Intrinsics.X86 allowed, rather than System.Numerics.Vectors? It would enable MultiplyAddAdjacent. With just the Numerics kind of SIMD I don't even know how to do this, there are no shuffles and the only kind of horizontal addition is a full-on dot product, nothing pair-wiseManon
@PeterCordes Yes, I can use any of SSE instruction. Maybe it's a stupid question but what is the difference between them (if you can post link do description of this instructions).?Jonahjonas
stackoverflow.com/tags/sse/info covers that.Cenogenesis
How is your pixel data represented in that byte array?Eryn
@Eryn I have array with all pixels ie. {array[0] - Red value (0-255), array[1] - Green value (0-255), array[3] - Blue value (0-255), array[4] - Alpha value (0-255)}.Jonahjonas
C
11

It requires System.Runtime.Intrinsics.Experimental.dll and unsafe, but it’s relatively straightforward, and probably fast enough for many practical applications.

/// <summary>Load 4 pixels of RGB</summary>
static unsafe Vector128<int> load4( byte* src )
{
    return Sse2.LoadVector128( (int*)src );
}

/// <summary>Pack red channel of 8 pixels into ushort values in [ 0xFF00 .. 0 ] interval</summary>
static Vector128<ushort> packRed( Vector128<int> a, Vector128<int> b )
{
    Vector128<int> mask = Vector128.Create( 0xFF );
    a = Sse2.And( a, mask );
    b = Sse2.And( b, mask );
    return Sse2.ShiftLeftLogical128BitLane( Sse41.PackUnsignedSaturate( a, b ), 1 );
}

/// <summary>Pack green channel of 8 pixels into ushort values in [ 0xFF00 .. 0 ] interval</summary>
static Vector128<ushort> packGreen( Vector128<int> a, Vector128<int> b )
{
    Vector128<int> mask = Vector128.Create( 0xFF00 );
    a = Sse2.And( a, mask );
    b = Sse2.And( b, mask );
    return Sse41.PackUnsignedSaturate( a, b );
}

/// <summary>Pack blue channel of 8 pixels into ushort values in [ 0xFF00 .. 0 ] interval</summary>
static Vector128<ushort> packBlue( Vector128<int> a, Vector128<int> b )
{
    a = Sse2.ShiftRightLogical128BitLane( a, 1 );
    b = Sse2.ShiftRightLogical128BitLane( b, 1 );
    Vector128<int> mask = Vector128.Create( 0xFF00 );
    a = Sse2.And( a, mask );
    b = Sse2.And( b, mask );
    return Sse41.PackUnsignedSaturate( a, b );
}

/// <summary>Load 8 pixels, split into RGB channels.</summary>
static unsafe void loadRgb( byte* src, out Vector128<ushort> red, out Vector128<ushort> green, out Vector128<ushort> blue )
{
    var a = load4( src );
    var b = load4( src + 16 );
    red = packRed( a, b );
    green = packGreen( a, b );
    blue = packBlue( a, b );
}

const ushort mulRed = (ushort)( 0.29891 * 0x10000 );
const ushort mulGreen = (ushort)( 0.58661 * 0x10000 );
const ushort mulBlue = (ushort)( 0.11448 * 0x10000 );

/// <summary>Compute brightness of 8 pixels</summary>
static Vector128<short> brightness( Vector128<ushort> r, Vector128<ushort> g, Vector128<ushort> b )
{
    r = Sse2.MultiplyHigh( r, Vector128.Create( mulRed ) );
    g = Sse2.MultiplyHigh( g, Vector128.Create( mulGreen ) );
    b = Sse2.MultiplyHigh( b, Vector128.Create( mulBlue ) );
    var result = Sse2.AddSaturate( Sse2.AddSaturate( r, g ), b );
    return Vector128.AsInt16( Sse2.ShiftRightLogical( result, 8 ) );
}

/// <summary>Convert buffer from RGBA to grayscale.</summary>
/// <remarks>
/// <para>If your image has line paddings, you'll want to call this once per line, not for the complete image.</para>
/// <para>If width of the image is not multiple of 16 pixels, you'll need to do more work to handle the last few pixels of every line.</para>
/// </remarks>
static unsafe void convertToGrayscale( byte* src, byte* dst, int count )
{
    byte* srcEnd = src + count * 4;
    while( src < srcEnd )
    {
        loadRgb( src, out var r, out var g, out var b );
        var low = brightness( r, g, b );
        loadRgb( src + 32, out r, out g, out b );
        var hi = brightness( r, g, b );

        var bytes = Sse2.PackUnsignedSaturate( low, hi );
        Sse2.Store( dst, bytes );

        src += 64;
        dst += 16;
    }
}

However, equivalent C++ implementation would be faster. C# did a decent job inlining these functions i.e. convertToGrayscale contains no function calls. But the code of that function is far from optimal. The .NET failed to propagate constants, for the magic numbers it emitted code like this inside the loop:

mov         r8d,962Ch
vmovd       xmm1,r8d
vpbroadcastw xmm1,xmm1

The generated code only uses 6 out of 16 registers. There’re enough available registers for all the magic numbers involved.

Also .NET emits many redundant instructions which just shuffle data around:

vmovaps xmm2, xmm0
vmovaps xmm3, xmm1
Colonist answered 21/11, 2019 at 13:26 Comment(0)
M
4

With a little less precision than the other answer, 7 bit fixed point scales could be used instead of 16 bit fixed point scales, enabling the use of PMADDUBSW. That also does not require any shuffling before the multiply. Then PMADDWD can be abused as a pairwise horizontal addition, so there is still no shuffling after the multiplication. That has a relatively bad latency, but that would be hidden by instruction level parallelism, the CPU isn't just go to sit there and do nothing.

Since this code is supposed to write to a different buffer than it reads from, it is safe to use the "step back and do a single unaligned iteration"-trick to handle the last block of pixels if there are fewer than 16 of them left.

I changed the blue weight to 128 * 0.118 because then it comes out as 15, which is closer to 14.65344 (the unrounded scaled weight). Also, letting it round down to 14 makes the total weight 127, which would mean that later dividing by 128 loses brightness.

All combined,

static unsafe void convertToGrayscale(byte* src, byte* dst, int count)
{
    int countMain = count & -16;
    byte* srcEnd = src + countMain * 4;
    byte* srcRealEnd = src + count * 4;
    byte* dstRealEnd = dst + count;
    sbyte scaleR = (sbyte)(128 * 0.29891);
    sbyte scaleG = (sbyte)(128 * 0.58661);
    sbyte scaleB = (sbyte)(128 * 0.118);
    Vector128<sbyte> scales = Vector128.Create(scaleR, scaleG, scaleB, 0, scaleR, scaleG, scaleB, 0, scaleR, scaleG, scaleB, 0, scaleR, scaleG, scaleB, 0);
    Vector128<short> ones = Vector128.Create((short)1);
    do
    {
        while (src < srcEnd)
        {
            var block0 = Sse2.LoadVector128(src);
            var block1 = Sse2.LoadVector128(src + 16);
            var block2 = Sse2.LoadVector128(src + 32);
            var block3 = Sse2.LoadVector128(src + 48);
            var scaled0 = Ssse3.MultiplyAddAdjacent(block0, scales);
            var scaled1 = Ssse3.MultiplyAddAdjacent(block1, scales);
            var scaled2 = Ssse3.MultiplyAddAdjacent(block2, scales);
            var scaled3 = Ssse3.MultiplyAddAdjacent(block3, scales);
            var t0 = Sse2.MultiplyAddAdjacent(scaled0, ones);
            var t1 = Sse2.MultiplyAddAdjacent(scaled1, ones);
            var t2 = Sse2.MultiplyAddAdjacent(scaled2, ones);
            var t3 = Sse2.MultiplyAddAdjacent(scaled3, ones);
            var c01 = Sse2.PackSignedSaturate(t0, t1);
            c01 = Sse2.ShiftRightLogical(c01, 7);
            var c23 = Sse2.PackSignedSaturate(t2, t3);
            c23 = Sse2.ShiftRightLogical(c23, 7);
            var c0123 = Sse2.PackUnsignedSaturate(c01, c23);
            Sse2.Store(dst, c0123);
            src += 64;
            dst += 16;
        }
        // hack to re-use the main loop for the "tail"
        if (src == srcRealEnd)
            break;
        srcEnd = srcRealEnd;
        src = srcRealEnd - 64;
        dst = dstRealEnd - 16;
    } while (true);
}

On my PC, this is about twice as fast as the solution based on PMULHUW.

Manon answered 10/7, 2020 at 6:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.