Next method, in my test, almost 3 times faster as the accepted answer. (Always faster on more than 3 characters or six bytes, a bit slower on less or equal to three characters or six bytes.) (Note that the accepted answer can read/write outside the bounds of the array.)
(Update While having a pointer there's no need to call the property to get the length. Using that pointer is a bit faster, but requires either a runtime check or, as in next example, a project configuration to build for each platform. Define X86 and X64 under each configuration.)
static unsafe void SwapV2(byte[] source)
{
fixed (byte* psource = source)
{
#if X86
var length = *((uint*)(psource - 4)) & 0xFFFFFFFEU;
#elif X64
var length = *((uint*)(psource - 8)) & 0xFFFFFFFEU;
#else
var length = (source.Length & 0xFFFFFFFE);
#endif
while (length > 7)
{
length -= 8;
ulong* pulong = (ulong*)(psource + length);
*pulong = ( ((*pulong >> 8) & 0x00FF00FF00FF00FFUL)
| ((*pulong << 8) & 0xFF00FF00FF00FF00UL));
}
if(length > 3)
{
length -= 4;
uint* puint = (uint*)(psource + length);
*puint = ( ((*puint >> 8) & 0x00FF00FFU)
| ((*puint << 8) & 0xFF00FF00U));
}
if(length > 1)
{
ushort* pushort = (ushort*)psource;
*pushort = (ushort) ( (*pushort >> 8)
| (*pushort << 8));
}
}
}
Five tests with 300.000 times 8192 bytes
- SwapV2: 1055, 1051, 1043, 1041, 1044
- SwapX2: 2802, 2803, 2803, 2805, 2805
Five tests with 50.000.000 times 6 bytes
- SwapV2: 1092, 1085, 1086, 1087, 1086
- SwapX2: 1018, 1019, 1015, 1017, 1018
But if the data is large and performance really matters, you could use SSE or AVX. (13 times faster.) https://pastebin.com/WaFk275U
Test 5 times, 100000 loops with 8192 bytes or 4096 chars
- SwapX2 : 226, 223, 225, 226, 227 Min: 223
- SwapV2 : 113, 111, 112, 114, 112 Min: 111
- SwapA2 : 17, 17, 17, 17, 16 Min: 16