How would you transpose a binary matrix?

Asked 31/7, 2015 at 9:17 Answered 31/5, 2021 at 10:51

I have binary matrices in C++ that I repesent with a vector of 8-bit values.

For example, the following matrix:

1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1

is represented as:

const uint8_t matrix[] = {
    0b01010101,
    0b00110011,
    0b00001111,
};

The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.

I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.

Just to clarify, for the above example, I'd like to get the following result from the transposition:

const uint8_t transposed[] = {
    0b00000000,
    0b00000100,
    0b00000010,
    0b00000110,
    0b00000001,
    0b00000101,
    0b00000011,
    0b00000111,
};

NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.

Guck answered 31/7, 2015 at 9:17 Comment(14)

I don't understand the transposed output: why is the first line 0b00000000 instead of 0b00000001? why is the second line 0b00000100 instead of 0b00000010? ... – Bors 31/7, 2015 at 9:28

I can't see how you can really avoid having to manually calculate each bit. Each row of your result has a bit from every row of your source. That does rather prevent any useful parallelism... – Disinfectant 31/7, 2015 at 9:29

@Bors because the first column is full of zeros (the matrix is 3x7 but represented in 3x8) – Hippocras 31/7, 2015 at 9:30

Since it is a 3×8 matrix, the output of the transposition is a 8×3 matrix. Transposition means that columns become rows. – Guck 31/7, 2015 at 9:30

@Paul yes I haven't found any easy way to do it either, that's why I'm asking here. I bet there could be some sort of bit twiddling or SIMD or whatever, which could get the job done. – Guck 31/7, 2015 at 9:32

Do you need to physically transpose the data? That's very inefficient. Can you wrap the data in an interface and just flag that the data is transposed? This could help optimise some matrix operations – Telford 31/7, 2015 at 9:40

@Telford - Yes. I need this for two purposes. 1. Sometimes I'd like to work with the transposed matrix. And more importantly: 2. I'd like to use this as a means to interleave stuff. – Guck 31/7, 2015 at 9:42

I think your approach is highly optimised for the use-case you mention, multiplying by vectors, but not for other use-cases, for example matrix multiplication or transposition. It's hard to see how you'd get around that. Also how do handle a matrix bigger than 8x8? – Telford 31/7, 2015 at 10:47

From given example I see that you always transpose 8x8 matrix (for smaller matrices you just fill the remaining rows with zeros). In this case and if your code will work on 64-bit CPU a pretty efficient algorithm exists. It is described in Knuth's "The art of computer programming" vol. 4a chapter 7.1.3. You could find an implementation in function "flipDiagA1H8" on this page: chessprogramming.wikispaces.com/Flipping+Mirroring+and+Rotating – Howlet 31/7, 2015 at 10:57

@HighPerformanceMark I've chosen it in such a way that the most common operation I perform is the cheapest (which is multiplying with bit vectors). – Guck 31/7, 2015 at 11:5

@Telford if I had a way to do a cheap transposition, matrix multiplication would be cheap too. As for matrices bigger than 8x8, I guess I can divide the operation to 8x8 pieces. – Guck 31/7, 2015 at 11:20

@EvgenyKluev Thanks! I think transposition is flipDiagA8H1, right? – Guck 31/7, 2015 at 11:20

In chess programming row number zero is at the bottom. – Howlet 31/7, 2015 at 11:23

@EvgenyKluev Sorry, I clearly misread your comment. – Osculate 31/7, 2015 at 13:54

I've spent more time looking for a solution, and I've found some good ones.

The SSE2 way

On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.

This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.

The idea is simple:

#include <emmintrin.h>
Pack 16 uint8_t variables into an __m128i
Use _mm_movemask_epi8 to get the MSBs of each byte, producing an uint16_t
Use _mm_slli_epi64 to shift the 128-bit register by one
Repeat until you've got all 8 uint16_ts

A generic 32-bit solution

Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.

NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.

After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.

Pack 4 uint8_ts in a 32-bit variable
Mask the 1st bit of each byte (using 0x80808080)
Multiply it with 0x02040810
Take the 4 LSBs of the upper 32 bits of the multiplication
Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.

If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.

Guck answered 6/8, 2015 at 14:20 Comment(2)

@étale-cohomology Yes, see github.com/Venemo/fecmagic/blob/master/src/binarymatrix.h – Guck 8/10, 2017 at 6:48

@étale-cohomology In the long run it turned out that the SSE version wasn't worth the effort, because the compiler can optimize the 'generic' version to go faster than what I wrote with SSE. :) – Guck 10/10, 2017 at 14:14

Here is the text of Jay Foad's email to me regarding fast Boolean matrix transpose:

The heart of the Boolean transpose algorithm is a function I'll call transpose8x8 which transposes an 8x8 Boolean matrix packed in a 64-bit word (in row major order from MSB to LSB). To transpose any rectangular matrix whose width and height are multiples of 8, break it down into 8x8 blocks, transpose each one individually and store them at the appropriate place in the output. To load an 8x8 block you have to load 8 individual bytes and shift and OR them into a 64-bit word. Same kinda thing for storing.

A plain C implementation of transpose8x8 relies on the fact that all the bits on any diagonal line parallel to the leading diagonal move the same distance up/down and left/right. For example, all the bits just above the leading diagonal have to move one place left and one place down, i.e. 7 bits to the right in the packed 64-bit word. This leads to an algorithm like this:

transpose8x8(word) {

  return
    (word & 0x0100000000000000) >> 49 // top right corner

  | (word & 0x0201000000000000) >> 42

  | ...

  | (word & 0x4020100804020100) >> 7 // just above diagonal

  | (word & 0x8040201008040201) // leading diagonal

  | (word & 0x0080402010080402) << 7 // just below diagonal

  | ...
  | (word & 0x0000000000008040) << 42

  | (word & 0x0000000000000080) << 49; // bottom left corner

}

This runs about 10x faster than the previous implementation, which copied each bit individually from the source byte in memory and merged it into the destination byte in memory.

Alternatively, if you have PDEP and PEXT instructions you can implement a perfect shuffle, and use that to do the transpose as mentioned in Hacker's Delight. This is significantly faster (but I don't have timings handy):

shuffle(word) {
    return pdep(word >> 32, 0xaaaaaaaaaaaaaaaa) | pdep(word, 0x5555555555555555);
} // outer perfect shuffle

transpose8x8(word) { return shuffle(shuffle(shuffle(word))); }

POWER's vgbbd instruction effectively implements the whole of transpose8x8 in a single instruction (and since it's a 128-bit vector instruction it does it twice, independently, on the low 64 bits and the high 64 bits). This gave about 15% speed-up over the plain C implementation. (Only 15% because, although the bit twiddling is much faster, the overall run time is now dominated by the time it takes to load 8 bytes and assemble them into the argument to transpose8x8, and to take the result and store it as 8 separate bytes.)

Affect answered 8/12, 2016 at 18:58 Comment(3)

What are PDEP and PEXT instructions? What is POWER? Can you please post your full implementation that I can benchmark against my implementation? – Guck 9/12, 2016 at 15:23

PDEP and PEXT are – Affect 12/12, 2016 at 22:43

@Guck POWER is the ISA used on modern IBM mainframes (in addition to Z/Architecture). It's based on PowerPC. This instruction is not available on x86 or ARM computers. – Whitmire 4/2, 2021 at 22:37

My suggestion is that, you don't do the transposition, rather you add one bit information to your matrix data, indicating whether the matrix is transposed or not.

Now, if you want to multiply a transposd matrix with a vector, it will be the same as multiplying the matrix on the left by the vector (and then transpose). This is easy: just some xor operations of your 8-bit numbers.

This however makes some other operations complicated (e.g. adding two matrices). But in the comment you say that multiplication is exactly what you want to optimize.

Erfurt answered 31/7, 2015 at 13:1 Comment(2)

Sorry, but this is not an acceptable solution. I need to actually transpose the matrix because I need it at the output of my interleaver. :) Regardless, I would welcome an example on how to perform the vector multiplication without actually transposing the matrix. – Guck 31/7, 2015 at 13:31

This is clever (hence +1), however my use case requires computing A*T(A) ;/ – Taconite 15/11, 2018 at 18:34

My suggestion would be to use a lookup table to speed up the processing.

Another thing to note is with the current definition of your matrix the maximum size will be 8x8 bits. This fits into a uint64_t so we can use this to our advantage especially when using a 64-bit platform.

I have worked out a simple example using a lookup table which you can find below and run using: http://www.tutorialspoint.com/compile_cpp11_online.php online compiler.

Example code

#include <iostream>
#include <bitset>
#include <stdint.h>
#include <assert.h>

using std::cout;
using std::endl;
using std::bitset;

/* Static lookup table */
static uint64_t lut[256];

/* Helper function to print array */
template<int N>
void print_arr(const uint8_t (&arr)[N]){
    for(int i=0; i < N; ++i){
        cout << bitset<8>(arr[i]) << endl;
    }
}

/* Transpose function */

template<int N>
void transpose_bitmatrix(const uint8_t (&matrix)[N], uint8_t (&transposed)[8]){
    assert(N <= 8);

    uint64_t value = 0;
    for(int i=0; i < N; ++i){
        value = (value << 1) + lut[matrix[i]];
    }

    /* Ensure safe copy to prevent misalignment issues */
    /* Can be removed if input array can be treated as uint64_t directly */
    for(int i=0; i < 8; ++i){
        transposed[i] = (value >> (i * 8)) & 0xFF;
    }
}

/* Calculate lookup table */
void calculate_lut(void){
    /* For all byte values */
    for(uint64_t i = 0; i < 256; ++i){
        auto b = std::bitset<8>(i);
        auto v = std::bitset<64>(0);

        /* For all bits in current byte */
        for(int bit=0; bit < 8; ++bit){
            if(b.test(bit)){
                v.set((7 - bit) * 8);
            }
        }

        lut[i] = v.to_ullong();
    }
}

int main()
{
    calculate_lut();

    const uint8_t matrix[] = {
        0b01010101,
        0b00110011,
        0b00001111,
    };

    uint8_t transposed[8];

    transpose_bitmatrix(matrix, transposed);
    print_arr(transposed);

   return 0;
}

How it works

your 3x8 matrix will be transposed to a 8x3 matrix, represented in an 8x8 array. The issue is that you want to convert bits, your "horizontal" representation to a vertical one, divided over several bytes.

As I mentioned above, we can take advantage of the fact that the output (8x8) will always fit into a uint64_t. We will use this to our advantage because now we can use an uint64_t to write the 8 byte array, but we can also use it for to add, xor, etc. because we can perform basic arithmetic operations on a 64 bit integer.

Each entry in your 3x8 matrix (input) is 8 bits wide, to optimize processing we first generate 256 entry lookup table (for each byte value). The entry itself is a uint64_t and will contain a rotated version of the bits.

example:

byte = 0b01001111 = 0x4F
lut[0x4F] = 0x0001000001010101 = (uint8_t[]){ 0, 1, 0, 0, 1, 1, 1, 1 }

Now for the calculation:

For the calculations we use the uint64_t but keep in mind that under water it will represent a uint8_t[8] array. We simple shift the current value (start with 0), look up our first byte and add it to the current value.

The 'magic' here is that each byte of the uint64_t in the lookup table will either be 1 or 0 so it will only set the least significant bit (of each byte). Shifting the uint64_t will shift each byte, as long as we make sure we do not do this more than 8 times! we can do operations on each byte individually.

Issues

As someone noted in the comments: Translate(Translate(M)) != M so if you need this you need some additional work.

Perfomance can be improved by directly mapping uint64_t's instead of uint8_t[8] arrays since it omits a "safe-copy" to prevent alignment issues.

Spriggs answered 31/7, 2015 at 13:58 Comment(3)

How big is the lookup table? I mean, in your code it's only 256 long, but not sure where that number comes from. – Guck 31/7, 2015 at 14:17

the lookup table is 2kiB (2048 bytes). It has 256 entries of 64 bits (8 bytes). The reason for this is that you are using uint8_t, which is 8 bits --> 256 possible values (2^8), a uint8_t can contain 0 to 255. The limitation of this code is that it will only support 8x8 as a maximum size. – Spriggs 31/7, 2015 at 16:55

The limitation is also the reason why this is able to work, a uint64_t has 64 bits, which, conveniently, equals 8x8 which enables us to use a uint64_t as an 8x8 bit matrix and perform some math and or bit operations on it. – Spriggs 31/7, 2015 at 16:59

I have added a new awnser instead of editing my original one to make this more visible (no comment rights unfortunatly).

In your own awnser you add an additional requirement not present in the first one: It has to work on ARM Cortex-M

I did come up with an alternative solution for ARM in my original awnser but omitted it as it was not part of the question and seemed off topic (mostly because of the C++ tag).

ARM Specific solution Cortex-M:

Some or most Cortex-M 3/4 have a bit banding region which can be used for exactly what you need, it expands bits into 32-bit fields, this region can be used to perform atomic bit operations.

If you put your array in a bitbanded region it will have an 'exploded' mirror in the bitband region where you can just use move operations on the bits itself. If you make a loop the compiler will surely be able to unroll and optimize to just move operations.

If you really want to, you can even setup a DMA controller to process an entire batch of transpose operations with a bit of effort and offload it entirely from the cpu :)

Perhaps this might still help you.

Spriggs answered 7/8, 2015 at 12:39 Comment(3)

Thanks for your post! I was unaware of bit banding. To be honest, I'm not sure if it would speed up my method. Working with four packed 8-bit numbers at once seems superior to working with single bits. – Guck 7/8, 2015 at 13:49

Anyway, you've got my upvote for both your answers, I appreciate taking your time to think about my problems. :) – Guck 7/8, 2015 at 14:0

Yes for the transpose operating on multiple bits at once will be faster ofcourse. I did want to mention the option as it might be usefull for other parts like matrix multiplication, bit extraction / testing / setting etc as it can operate on bits directly so you save all the shifting and masking operations and it just opens up a lot of other possibilities as my example using the DMA for large batches in example :) – Spriggs 7/8, 2015 at 14:18

This is a bit late, but I just stumbled across this interchange today. If you look at Hacker's Delight, 2nd Edition,there are several algorithms for efficiently transposing Boolean arrays, starting on page 141.

They are quite efficient: a colleague of mine obtained a factor about 10X speedup compared to naive coding, on an X86.

Affect answered 21/11, 2016 at 20:26 Comment(9)

Could you be more specific? Usually on StackOverflow either you give a full answer or you give a link to the full answer. – Guck 22/11, 2016 at 8:4

The answer in the book occupies several pages, so I was reluctant toa – Affect 4/12, 2016 at 14:53

...reluctant to attempt to type it all in correctly, not to mention potential violation of the author's copyright. – Affect 4/12, 2016 at 16:47

At least try to describe the basic thought of that algorithm, how and why it is better than the currently accepted answer, etc. – Guck 4/12, 2016 at 21:39

Jay Foad (www.dyalog.com) kindly offered to let me post an email – Affect 5/12, 2016 at 13:57

...on the topic, describing an 8x8 transpose, but it scales to any power of 2. The key insight is this: "A plain C implementation of transpose8x8 relies on the fact that all the bits on any diagonal line parallel to the leading diagonal move the same distance up/down and left/right. For example, all the bits just above the leading diagonal have to move one place left and one place down, i.e. 7 bits to the right in the packed 64-bit word." – Affect 5/12, 2016 at 14:3

Algorithm: transpose8x8(word) { return (word & 0x0100000000000000) >> 49 // top right corner | (word & 0x0201000000000000) >> 42 | ... | (word & 0x4020100804020100) >> 7 // just above diagonal | (word & 0x8040201008040201) // leading diagonal | (word & 0x0080402010080402) << 7 // just below diagonal | ... | (word & 0x0000000000008040) << 42 | (word & 0x0000000000000080) << 49; // bottom left corner } – Affect 5/12, 2016 at 14:3

Jay also said that if you have PDEP and PEXT you can implement a perfect shuffle and do the transpose that way for a bit more speed. See Hacker's Delight for why this works: shuffle(word) { return pdep(word >> 32, 0xaaaaaaaaaaaaaaaa) | pdep(word, 0x5555555555555555); } // outer perfect shuffle transpose8x8(word) { return shuffle(shuffle(shuffle(word))); } – Affect 5/12, 2016 at 14:5

I would suggest to add this to your answer instead of in the comments section. Also, does this perform better than the currently accepted answer? – Guck 5/12, 2016 at 16:28

Here's what I posted on gitub (mischasan/sse2/ssebmx.src) Changing INP() and OUT() to use induction vars saves an IMUL each. AVX256 does it twice as fast. AVX512 is not an option, because there is no _mm512_movemask_epi8().

#include <stdint.h>
#include <emmintrin.h>

#define INP(x,y) inp[(x)*ncols/8 + (y)/8]
#define OUT(x,y) out[(y)*nrows/8 + (x)/8]

void ssebmx(char const *inp, char *out, int nrows, int ncols)
{
    int rr, cc, i, h;
    union { __m128i x; uint8_t b[16]; } tmp;

    // Do the main body in [16 x 8] blocks:
    for (rr = 0; rr <= nrows - 16; rr += 16)
        for (cc = 0; cc < ncols; cc += 8) {
            for (i = 0; i < 16; ++i)
                tmp.b[i] = INP(rr + i, cc);
            for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
                *(uint16_t*)&OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
        }

    if (rr == nrows) return;

    // The remainder is a row of [8 x 16]* [8 x 8]?

    //  Do the [8 x 16] blocks:
    for (cc = 0; cc <= ncols - 16; cc += 16) {
        for (i = 8; i--;)
            tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc),
            tmp.b[i + 8] = h >> 8;
        for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
            OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x),
            OUT(rr, cc + i + 8) = h >> 8;
    }

    if (cc == ncols) return;

    //  Do the remaining [8 x 8] block:
    for (i = 8; i--;)
        tmp.b[i] = INP(rr + i, cc);
    for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
        OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}

HTH.

Investiture answered 29/3, 2017 at 4:26 Comment(0)

Inspired by Roberts answer, polynomial multiplication in Arm Neon can be utilised to scatter the bits --

poly8x8_t transpose(poly8x8_t a) {
   for (int i = 0; i < 3; i++) {
     auto q = vmull_p8(a,a);
     auto top = vreinterpret_u8_p16(vget_high_p16(q));
     auto low = vreinterpret_u8_p16(vget_low_p16(q));
     low = vadd_u8(low, low); // shift left by 1
     low = vadd_u8(low, top); // interleave the bits
     a = vreinterpret_p8_u8(low);
   }
   return a;
}

A generic approach for 64-bit processors

// transpose  bits in 2x2 blocks, first 4 rows
//   x = a b|c d|e f|g h      a i|c k|e m|g o   | byte 0
//       i j|k l|m n|o p      b j|d l|f n|h p   | byte 1
//       q r|s t|u v|w x      q A|s C|u E|w G   | byte 2
//       A B|C D|E F|G H      r B|t D|v F|h H   | byte 3 ...
// ----------------------

auto a = (x & 0x00aa00aa00aa00aaull);
auto b = (x & 0x5500550055005500ull);
auto c = (x & 0xaa55aa55aa55aa55ull) | (a << 7) | (b >> 7);

// transpose 2x2 blocks (first 4 rows shown)
//   aa bb cc dd      aa ii cc kk
//   ee ff gg hh   -> ee mm gg oo
//   ii jj kk ll      bb jj dd ll
//   mm nn oo pp      ff nn hh pp

auto d = (c & 0x0000cccc0000ccccull);
auto e = (c & 0x3333000033330000ull);
auto f = (c & 0xcccc3333cccc3333ull) | (d << 14) | (e >> 14);

// Final transpose of 4x4 bit blocks

auto g = (f & 0x00000000f0f0f0f0ull);
auto h = (f & 0x0f0f0f0f00000000ull);
x = (f & 0xf0f0f0f00f0f0f0full) | (g << 28) | (h >> 28);

In ARM each step can now be composed with 3 instructions:

auto tmp = vrev16_u8(x);
tmp = vshl_u8(tmp, plus_minus_1); // 0xff01ff01ff01ff01ull
x = vbsl_u8(mask_1, x, tmp);   // 0xaa55aa55aa55aa55ull

tmp = vrev32_u16(x);
tmp = vshl_u16(tmp, plus_minus_2); // 0xfefe0202fefe0202ull
x = vbsl_u8(mask_2, x, tmp);   // 0xcccc3333cccc3333ull

tmp = vrev64_u32(x);
tmp = vshl_u32(tmp, plus_minus_4); // 0xfcfcfcfc04040404ull
x = vbsl_u8(mask_4, x, tmp);   // 0xf0f0f0f00f0f0f0full

Also on Arm64 without NEON and with limited arbitrary immediates, one can use:

uint64_t transpose(uint64_t x) {
    auto b = x ^ (x >> 7);
    b &= 0x00aa00aa00aa00aa;
    x ^= b;
    x ^= b << 7;

    b = x ^ (x >> 14);
    b &= 0x0000cccc0000cccc;
    x ^= b;
    x ^= b << 14;

    b = x ^ (x >> 28);
    b &= 0x00000000f0f0f0f0;
    x ^= b;
    x ^= (uint64_t)b << 28;
    return x;
}

Debauchee answered 31/5, 2021 at 10:51 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

The SSE2 way

A generic 32-bit solution

Recommended topics

Hot tags