Faster way to swap endianness in C# with 32 bit words
Asked Answered
C

2

10

In this question, the following code:

public static void Swap(byte[] data)
{
        for (int i = 0; i < data.Length; i += 2)
        {
                byte b = data[i];
                data[i] = data[i + 1];
                data[i + 1] = b;
        }
}

was rewritten in unsafe code to improve its performance:

public static unsafe void SwapX2(Byte[] Source)  
{  
    fixed (Byte* pSource = &Source[0])  
    {  
        Byte* bp = pSource;  
        Byte* bp_stop = bp + Source.Length;  

        while (bp < bp_stop)  
        {
            *(UInt16*)bp = (UInt16)(*bp << 8 | *(bp + 1));  
            bp += 2;  
        }  
    }  
}

Assuming that one wanted to do the same thing with 32 bit words:

public static void SwapX4(byte[] data)
{
    byte temp;
    for (int i = 0; i < data.Length; i += 4)
    {
        temp = data[i];
        data[i] = data[i + 3];
        data[i + 3] = temp;
        temp = data[i + 1];
        data[i + 1] = data[i + 2];
        data[i + 2] = temp;
    }
}

how would this be rewritten in a similar fashion?

Clincher answered 25/7, 2012 at 23:19 Comment(1)
The rewrite, for the record, would only work correctly on a little-endian machine. Also, from the performance standpoint, the less reads from memory, the better. The right hand side of the expression has two and one is enough. The compiler might or might not optimize that away.Tojo
E
11
public static unsafe void SwapX4(Byte[] Source)  
{  
    fixed (Byte* pSource = &Source[0])  
    {  
        Byte* bp = pSource;  
        Byte* bp_stop = bp + Source.Length;  

        while (bp < bp_stop)  
        {
            *(UInt32*)bp = (UInt32)(
                (*bp       << 24) |
                (*(bp + 1) << 16) |
                (*(bp + 2) <<  8) |
                (*(bp + 3)      ));
            bp += 4;  
        }  
    }  
}

Note that both of these functions (my SwapX4 and your SwapX2) will only swap anything on a little-endian host; when run on a big-endian host, they are an expensive no-op.

Emaciation answered 25/7, 2012 at 23:23 Comment(2)
Awesome. It runs twice as fast as the original, tested on 100,000 8192-byte arrays.Clincher
Careful if the length of Source isn't a multiple of four. This code can corrupt up to three bytes of memory, or cause a page fault. To prevent this, throw an exception if the length isn't a multiple of four or don't swap the last bytes in case of no multiple of four. eg. Change "bp + Source.Length" with "bp + (Source.Length & 0xFFFFFFFC)"Surber
S
3

This version will not exceed the bounds of the buffer. Works on both Little and Big Endian architectures. And is faster on larger data. (Update: Add build configurations for x86 and x64, predefine X86 for 32 bit(x86) and X64 for 64 bit(x64) and it'll be slightly faster.)

public static unsafe void Swap4(byte[] source)
{
    fixed (byte* psource = source)
    {
#if X86
            var length = *((uint*)(psource - 4)) & 0xFFFFFFFEU;
#elif X64
            var length = *((uint*)(psource - 8)) & 0xFFFFFFFEU;
#else
            var length = (source.Length & 0xFFFFFFFE);
#endif
        while (length > 7)
        {
            length -= 8;
            ulong* pulong = (ulong*)(psource + length);
            *pulong = ( ((*pulong >> 24) & 0x000000FF000000FFUL)
                      | ((*pulong >> 8)  & 0x0000FF000000FF00UL)
                      | ((*pulong << 8)  & 0x00FF000000FF0000UL)
                      | ((*pulong << 24) & 0xFF000000FF000000UL));
        }
        if(length != 0)
        {
            uint* puint = (uint*)psource;
            *puint = ( ((*puint >> 24))
                     | ((*puint >> 8) & 0x0000FF00U)
                     | ((*puint << 8) & 0x00FF0000U)
                     | ((*puint << 24)));
        }
    }
}
Surber answered 13/3, 2018 at 5:27 Comment(6)
Interesting. I'll have a look.Clincher
This might only be faster on 64 bit (and as i mentioned, with larger chunks of data). The other code is good, but see my comment there. I tried to modify the code but that was rejected by stackoverflow. Which kind of worries me that many people will copy paste code that, in case they don't use care, will generate bugs. This works for swap2 also and that version is up to three times faster with large chunks of data.Surber
Absolute rigor wasn't a design requirement; the data being fed to this function is already checked for the error conditions you described. Best possible performance was the most important consideration.Clincher
Nowhere is there a caution warning the data should be checked for certain conditions. So the unaware programmer might use that code in erroneous conditions.Surber
It's not in a library for public consumption. It's used in a single application, and unlikely to be used anywhere else. As I said before, the requirement was maximum performance, not safety. If safety were a requirement, we would have never considered unsafe.Clincher
I was not refering to your case, but the reader his case. This is a website which many use to find code, which they often use without understanding the details. (imo) / Made an update which makes it slightly faster.Surber

© 2022 - 2024 — McMap. All rights reserved.