Performance improvement for mirroring bit matrix around the diagonal
Asked Answered
L

3

6

I have some code that manages data received from an array of sensors. The PIC that controls the sensors uses 8 SAR-ADCs in parallel to read 4096 data bytes. It means it reads the most significant bit for the first 8 bytes; then it reads their second bit and so on until the eighth (least significant bit).

Basically, for each 8 bytes it reads, it creates (and sends forth to the computer) 8 bytes as follows:

// rxData[0] = MSB[7] MSB[6] MSB[5] MSB[4] MSB[3] MSB[2] MSB[1] MSB[0]
// rxData[1] =  B6[7]  B6[6]  B6[5]  B6[4]  B6[3]  B6[2]  B6[1]  B6[0]
// rxData[2] =  B5[7]  B5[6]  B5[5]  B5[4]  B5[3]  B5[2]  B5[1]  B5[0]
// rxData[3] =  B4[7]  B4[6]  B4[5]  B4[4]  B4[3]  B4[2]  B4[1]  B4[0]
// rxData[4] =  B3[7]  B3[6]  B3[5]  B3[4]  B3[3]  B3[2]  B3[1]  B3[0]
// rxData[5] =  B2[7]  B2[6]  B2[5]  B2[4]  B2[3]  B2[2]  B2[1]  B2[0]
// rxData[6] =  B1[7]  B1[6]  B1[5]  B1[4]  B1[3]  B1[2]  B1[1]  B1[0]
// rxData[7] = LSB[7] LSB[6] LSB[5] LSB[4] LSB[3] LSB[2] LSB[1] LSB[0]

This pattern is repeated for all the 4096 bytes the system reads and I process.

Imagine that each 8 bytes read are taken separately, we can then see them as an 8-by-8 array of bits. I need to mirror this array around the diagonal going from its bottom-left (LSB[7]) to its top-right (MSB[0]). Once this is done, the resulting 8-by-8 array of bits contains in its rows the correct data bytes read from the sensors. I used to perform this operation on the PIC controller, using left shifts and so on, but that slowed down the system quite a lot. Thus, this operation is now performed on the computer where we process the data, using the following code:

BitArray ba = new BitArray(rxData);
BitArray ba2 = new BitArray(ba.Count);
for (int i = 0; i < ba.Count; i++)
{
    ba2[i] = ba[(((int)(i / 64)) + 1) * 64 - 1 - (i % 8) * 8 - (int)(i / 8) + ((int)(i / 64)) * 8];
}
byte[] data = new byte[rxData.Length];
ba2.CopyTo(data, 0);

Note that THIS CODE WORKS.

rxData is the received byte array.

The formula I use for the index of ba[] in the loop codes for the mirroring of the arrays I described above. The size of the array is checked elsewhere to make sure it always contains the correct number (4096) of bytes.

So far this was the background for my problem.

In each processing loop of my system I need to perform that mirroring twice, because my data processing is on the difference between two arrays acquired consecutively. Speed is important for my system (possibly the main constraint on the processing), and the mirroring accounts for between 10% and 30% of the execution time of my processing.

I would like to know if there are alternative solutions I might compare to my mirroring code and that might allow me to improve performances. Using the BitArrays is the only way I found to address the different bits in the received bytes.

Leticia answered 13/2, 2014 at 4:4 Comment(3)
Flipping often means to complement the bits, while mirroring will make the reader thinks that you want to reverse the bit or flip it around the center. You should state it clearly such as "transpose bit (array)". Also, no need to cast to int everywhere since / between 2 integers is integer divisionDefeat
And why don't you read the 8 values and send it as-is. That'll speed up both the microcontroller and the PC. Anyway, your question is interesting thoughDefeat
@lu'u Because of the way the hardware was designed, this is (sadly) the fastest read-out method.Leticia
D
2

Operating on separate bits is very slow, and creating 2 bit arrays and copy data back and forth creates further overheads

The simplest obvious solution is just extracting the bits and combining them again. You can do it with a loop but since it uses both left and right shift at the same time, you need a function to handle negative shift amount. As a result here I unrolled it for easier understanding and more speed

out[0] = (byte)(((rxData[0] & 0x80) >> 0) | ((rxData[1] & 0x80) >> 1) | ((rxData[2] & 0x80) >> 2) | ((rxData[3] & 0x80) >> 3) |
                ((rxData[4] & 0x80) >> 4) | ((rxData[5] & 0x80) >> 5) | ((rxData[6] & 0x80) >> 6) | ((rxData[7] & 0x80) >> 7));
            
out[1] = (byte)(((rxData[0] & 0x40) << 1) | ((rxData[1] & 0x40) >> 0) | ((rxData[2] & 0x40) >> 1) | ((rxData[3] & 0x40) >> 2) |
                ((rxData[4] & 0x40) >> 3) | ((rxData[5] & 0x40) >> 4) | ((rxData[6] & 0x40) >> 5) | ((rxData[7] & 0x40) >> 6));
            
out[2] = (byte)(((rxData[0] & 0x20) << 2) | ((rxData[1] & 0x20) << 1) | ((rxData[2] & 0x20) >> 0) | ((rxData[3] & 0x20) >> 1) |
                ((rxData[4] & 0x20) >> 2) | ((rxData[5] & 0x20) >> 3) | ((rxData[6] & 0x20) >> 4) | ((rxData[7] & 0x20) >> 5));
            
out[3] = (byte)(((rxData[0] & 0x10) << 3) | ((rxData[1] & 0x10) << 2) | ((rxData[2] & 0x10) << 1) | ((rxData[3] & 0x10) >> 0) |
                ((rxData[4] & 0x10) >> 1) | ((rxData[5] & 0x10) >> 2) | ((rxData[6] & 0x10) >> 3) | ((rxData[7] & 0x10) >> 4));
            
out[4] = (byte)(((rxData[0] & 0x08) << 4) | ((rxData[1] & 0x08) << 3) | ((rxData[2] & 0x08) << 2) | ((rxData[3] & 0x08) << 1) |
                ((rxData[4] & 0x08) >> 0) | ((rxData[5] & 0x08) >> 1) | ((rxData[6] & 0x08) >> 2) | ((rxData[7] & 0x08) >> 3));
            
out[5] = (byte)(((rxData[0] & 0x04) << 5) | ((rxData[1] & 0x04) << 4) | ((rxData[2] & 0x04) << 3) | ((rxData[3] & 0x04) << 2) |
                ((rxData[4] & 0x04) << 1) | ((rxData[5] & 0x04) >> 0) | ((rxData[6] & 0x04) >> 1) | ((rxData[7] & 0x04) >> 2));
            
out[6] = (byte)(((rxData[0] & 0x02) << 6) | ((rxData[1] & 0x02) << 5) | ((rxData[2] & 0x02) << 4) | ((rxData[3] & 0x02) << 3) |
                ((rxData[4] & 0x02) << 2) | ((rxData[5] & 0x02) << 1) | ((rxData[6] & 0x02) >> 0) | ((rxData[7] & 0x02) >> 1));

out[7] = (byte)(((rxData[0] & 0x01) << 7) | ((rxData[1] & 0x01) << 6) | ((rxData[2] & 0x01) << 5) | ((rxData[3] & 0x01) << 4) |
                ((rxData[4] & 0x01) << 3) | ((rxData[5] & 0x01) << 2) | ((rxData[6] & 0x01) << 1) | ((rxData[7] & 0x01) >> 0));

Obviously this is still very slow because it operates byte-by-byte. An optimal solution would operate on multiple bytes at once, for example with SIMD and/or multithreading. Especially since you're doing that for lots of data, .NET SIMD intrinsics would be extremely useful

Defeat answered 13/2, 2014 at 15:8 Comment(2)
Thanks. This is definitely way faster than the option with the bitarrays. I wrapped it in my loops, and it takes an average of 60-70 ticks vs the average 1800 of the bitarray operations. However, the way you wrote it throws a casting error. Apparently C# does the when it does the shifts casts the number to int, so I need to recast it to byte.Leticia
I don't have much experience with C# so I don't know about its casting rules. This code can easily be optimized using SIMD, so if possible, use unsafe code or a seperate native module with SSE/AVX intrinsicsDefeat
B
2

You will probably find that BitVector performs a good deal better than BitArray.

BitVector32 is more efficient than BitArray for Boolean values and small integers that are used internally. A BitArray can grow indefinitely as needed, but it has the memory and performance overhead that a class instance requires. In contrast, a BitVector32 uses only 32 bits.

https://learn.microsoft.com/en-us/dotnet/api/system.collections.specialized.bitvector32

If you initialize an array of BitVector32 and operate on those, it should be faster than operating on BitArray as you do now.

You may also get a performance boost if you use one thread to perform the mirroring and a second thread to perform the analysis of consecutive reads. The Task Parallel Library Dataflow provides a nice framework for that type of solution. You could have one Source Block to acquire the data buffer, one Transform Block to perform the mirroring, and one Target Block to perform the data processing.

Blooper answered 13/2, 2014 at 4:15 Comment(1)
I will definetely test the BitVector tomorrow. Regarding the parallelization, I am not sure I understand what you mean. It is my plan, once I got the more optimal mirroring, to interleave the mirroring with the rest of the processing, rather than have it sequential as it is now, but not necessarily on different threads (I generally have to combine data from different sensors to obtain the results I need, so I cannot guarantee there would not be conflicts)Leticia
D
2

Operating on separate bits is very slow, and creating 2 bit arrays and copy data back and forth creates further overheads

The simplest obvious solution is just extracting the bits and combining them again. You can do it with a loop but since it uses both left and right shift at the same time, you need a function to handle negative shift amount. As a result here I unrolled it for easier understanding and more speed

out[0] = (byte)(((rxData[0] & 0x80) >> 0) | ((rxData[1] & 0x80) >> 1) | ((rxData[2] & 0x80) >> 2) | ((rxData[3] & 0x80) >> 3) |
                ((rxData[4] & 0x80) >> 4) | ((rxData[5] & 0x80) >> 5) | ((rxData[6] & 0x80) >> 6) | ((rxData[7] & 0x80) >> 7));
            
out[1] = (byte)(((rxData[0] & 0x40) << 1) | ((rxData[1] & 0x40) >> 0) | ((rxData[2] & 0x40) >> 1) | ((rxData[3] & 0x40) >> 2) |
                ((rxData[4] & 0x40) >> 3) | ((rxData[5] & 0x40) >> 4) | ((rxData[6] & 0x40) >> 5) | ((rxData[7] & 0x40) >> 6));
            
out[2] = (byte)(((rxData[0] & 0x20) << 2) | ((rxData[1] & 0x20) << 1) | ((rxData[2] & 0x20) >> 0) | ((rxData[3] & 0x20) >> 1) |
                ((rxData[4] & 0x20) >> 2) | ((rxData[5] & 0x20) >> 3) | ((rxData[6] & 0x20) >> 4) | ((rxData[7] & 0x20) >> 5));
            
out[3] = (byte)(((rxData[0] & 0x10) << 3) | ((rxData[1] & 0x10) << 2) | ((rxData[2] & 0x10) << 1) | ((rxData[3] & 0x10) >> 0) |
                ((rxData[4] & 0x10) >> 1) | ((rxData[5] & 0x10) >> 2) | ((rxData[6] & 0x10) >> 3) | ((rxData[7] & 0x10) >> 4));
            
out[4] = (byte)(((rxData[0] & 0x08) << 4) | ((rxData[1] & 0x08) << 3) | ((rxData[2] & 0x08) << 2) | ((rxData[3] & 0x08) << 1) |
                ((rxData[4] & 0x08) >> 0) | ((rxData[5] & 0x08) >> 1) | ((rxData[6] & 0x08) >> 2) | ((rxData[7] & 0x08) >> 3));
            
out[5] = (byte)(((rxData[0] & 0x04) << 5) | ((rxData[1] & 0x04) << 4) | ((rxData[2] & 0x04) << 3) | ((rxData[3] & 0x04) << 2) |
                ((rxData[4] & 0x04) << 1) | ((rxData[5] & 0x04) >> 0) | ((rxData[6] & 0x04) >> 1) | ((rxData[7] & 0x04) >> 2));
            
out[6] = (byte)(((rxData[0] & 0x02) << 6) | ((rxData[1] & 0x02) << 5) | ((rxData[2] & 0x02) << 4) | ((rxData[3] & 0x02) << 3) |
                ((rxData[4] & 0x02) << 2) | ((rxData[5] & 0x02) << 1) | ((rxData[6] & 0x02) >> 0) | ((rxData[7] & 0x02) >> 1));

out[7] = (byte)(((rxData[0] & 0x01) << 7) | ((rxData[1] & 0x01) << 6) | ((rxData[2] & 0x01) << 5) | ((rxData[3] & 0x01) << 4) |
                ((rxData[4] & 0x01) << 3) | ((rxData[5] & 0x01) << 2) | ((rxData[6] & 0x01) << 1) | ((rxData[7] & 0x01) >> 0));

Obviously this is still very slow because it operates byte-by-byte. An optimal solution would operate on multiple bytes at once, for example with SIMD and/or multithreading. Especially since you're doing that for lots of data, .NET SIMD intrinsics would be extremely useful

Defeat answered 13/2, 2014 at 15:8 Comment(2)
Thanks. This is definitely way faster than the option with the bitarrays. I wrapped it in my loops, and it takes an average of 60-70 ticks vs the average 1800 of the bitarray operations. However, the way you wrote it throws a casting error. Apparently C# does the when it does the shifts casts the number to int, so I need to recast it to byte.Leticia
I don't have much experience with C# so I don't know about its casting rules. This code can easily be optimized using SIMD, so if possible, use unsafe code or a seperate native module with SSE/AVX intrinsicsDefeat
D
0

This is actually the same as the get column in a bitboard problem so it can be solved even more efficiently by considering the byte array as a 64-bit integer

byte get_byte(ulong matrix, uint col)
{
    const ulong column_mask = 0x8080808080808080ull;
    const ulong magic       = 0x2040810204081ull;
    
    ulong column = ((matrix << col) & column_mask) * magic;
    return (byte)(column >> 56);
}

// Actually the below step is not needed. You can read rxData directly into the `ulong`
// variable instead of a bit array. Remember to CHANGE THE ENDIANNESS if necessary
ulong matrix = (rxData[7] << 56) | (rxData[6] << 48) | (rxData[5] << 40) | (rxData[4] << 32)
             | (rxData[3] << 24) | (rxData[2] << 16) | (rxData[1] <<  8) | rxData[0];

for (int i = 0; i < 8; i++)
    data[i] = get_byte(matrix, i);

In newer x86 CPUs you can use the PDEP instruction in the BMI2 instruction set. I'm not sure if there's any corresponding intrinsic in C#. If the intrinsic doesn't exist then you have to use native code like this

data[i] = _pext_u64(matrix, column_mask >> col);

Update:

The instruction has been added to .NET Core 3.0 as ParallelBitDeposit() intrinsic so it's now much easier and faster to do that from C#

ulong matrix = BitConverter.ToUInt64(rxData, 0);
for (int i = 0; i < 8; i++)
    data[i] = Bmi2.X64.ParallelBitDeposit(matrix, 0x8080808080808080UL >> i);

There's also ParallelBitExtract() for the inverse PEXT instruction

Defeat answered 1/8, 2018 at 8:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.