bitwise operations on vector<bool>

Asked 29/10, 2010 at 3:24 Answered 25/3, 2012 at 3:26

what's the best way to perform bitwise operations on vector<bool>?

as i understand, vector<bool> is a specialisation that uses one bit per boolean. I chose vector<bool> for memory saving reasons. I know that there are some problems with vector<bool> but for my needs it's apropriate.

now - what's the most performant way of aplying bitwise operations to whole such vectors?

if i do it in a for loop and readout each single bool and store it back, the way I understand it a lot more operations are performed inside in order to access the actual values.

thanks!

Forgetmenot answered 29/10, 2010 at 3:24 Comment(1)

+1 for getting the right answer by asking the wrong question. – Kaufmann 29/10, 2010 at 3:55

If the number of bits are fixed at compile time, you would be much better off using std::bitset

If not, (i.e. number of bits varies at runtime), then you should see and can use boost::dynamic_bitset)

In both of these, it is extremely easy to do all the bitwise operations.

Brack answered 29/10, 2010 at 3:44 Comment(3)

+1 because I'd use these libraries if I decided I could take the dependencies on for the project I was working on. – Cheung 29/10, 2010 at 3:46

The provided link to the bitset documentation changes on occasion and is no longer valid. There is a link to the current latest version and will become invalid: gcc.gnu.org/onlinedocs/libstdc++/libstdc++-api-4.5/a00396.html – Lasseter 22/2, 2011 at 20:9

Bitwise ops on either of those do auto-vectorize with gcc and clang, doing 16 or 32 bytes at a time. They're very good. godbolt.org/g/cnRFEG (But boost::dynamic_bitset has a fairly poor .count(), using a byte-LUT even when hardware popcnt and AVX2 instructions are available. std::bitset has good .count() performance, too.) – Joyous 31/7, 2017 at 5:14

Ignoring the title of your question, lets answer this question, instead:

what's the best way to perform bitwise operations on vector?

The best way is to define your vector as vector<unsigned char> (or vector<uint32_t>, or whichever other integer type you choose), and do your bitwise operations how you normally would for an array of unsigned integers. Things will be much faster this way, and there will be no hidden mechanism.

You can use division (or bitwise operators, if you're slick) to resolve which array index you need to operate against, and for-loops to apply bitwise operations larger than a single element.

Here's a related question: Bit twiddling a lot of bits in C

You will basically be doing these same operations, if and when you decide to wrap vector<unsigned some-int-type> with your own operators.

Cheung answered 29/10, 2010 at 3:30 Comment(5)

so if I still want the possibility to access my boolean values directly I actually have to encapsule vector<unsigned char> and do the mechanics vector<bool> does myself, just additionally also providing functionality for bitwise operations? – Forgetmenot 29/10, 2010 at 3:32

@Mat: You don't have to do anything, but to get the best performance, I do suggest doing what you just described. Wrap the whole thing in a class, and implement overloaded operators against it. Then you can optimize to your heart's content. It really isn't very complicated code to implement, and gives you the right balance of abstraction and performance. – Cheung 29/10, 2010 at 3:36

actually i just wanted to know by that whether this functionality doesn't maybe already exist? because it appears logical to me that this is something you want to do with bitvectors – Forgetmenot 29/10, 2010 at 3:39

@Mat: I think I've read that people regard the specialization on vector<bool> as having been a bad idea, so the standards committee might have decided not to throw good money after bad. I wouldn't be too surprised if there was a third party library, but there is nothing standard. – Cheung 29/10, 2010 at 3:43

@Mat: If you can use a third party library, go for ArunSaha's answer :) – Cheung 29/10, 2010 at 3:46

I read both these answers, but just wanted a quick solution, and implemented something horrible.

You can make the bitwise operators work on vector<bool>, but the code has to be specialized for the c++ standard library implementation or fall back to the slow form. Here's my operator| for GNU libstdc++-v3:

std::vector<bool> operator|(std::vector<bool> A, const std::vector<bool>& B)
{
    if (A.size() != B.size())
        throw std::invalid_argument("differently sized bitwise operands");

    std::vector<bool>::iterator itA = A.begin();
    std::vector<bool>::const_iterator itB = B.begin();

    // c++ implementation-specific
    while (itA < A.end())
        *(itA._M_p ++) |= *(itB._M_p ++); // word-at-a-time bitwise operation

    return A;
}

This is of course pretty bad. Somebody updates GCC, the new version stores things differently, and your code breaks for no apparent reason.

Cuttie answered 28/8, 2011 at 11:52 Comment(9)

This unfortunately doesn't auto-vectorize with gcc or clang for x86-64, with -O3 -march=haswell. godbolt.org/g/eJnfKp. It should be using AVX2 to vpor 32 bytes at a time, instead of only 8 at a time with scalar or. Boost's dynamic_bitset<> does auto-vectorize for a & b. But unfortunately boost's .count() function isn't very efficient. (But much better than std::count on vector<bool>, unless you use clang -stdlib=libc++ in which case that vectorizes with AVX2.) – Joyous 31/7, 2017 at 4:44

Anyway, I haven't found a way to get good code for a popcount of the result of a boolean op between two dynamically-sized bit-vectors, other than manually vectorizing it of course. – Joyous 31/7, 2017 at 4:46

@PeterCordes - clang can be coaxed into producing some pretty good for bit-ops (and in this example) using a temporary array on the stack. Storing to the stack and then reloading isn't actually a terrible way to get 4 values from a ymm into a place it can be used by popcnt. The mov r9, rcx; or r9, 8 ops interspersed with the vectorized and is a bit silly and may cost a little bit of throughput (why didn't they just use an offset in the addressing modes, or at least do it with a single add?). – Barytone 31/7, 2017 at 21:15

The don't unroll the popcnt enough to avoid the latency due to the false dep (and add an extra cycle due to the way they accumulate it), but it probably gets to better than 50% of the performance of a hand-tuned one, which is pretty good. For the hand-tuned one, it isn't clear to me if a pshufb nibble counting approach is better or a popcnt one, or perhaps a combination of the two. Probably the former since the output is already in a ymm. – Barytone 31/7, 2017 at 21:25

@BeeOnRope: On Haswell/Skylake, AVX2 unpacking to nibbles then vpshufb -> vpaddb is faster than scalar popcnt for large arrays. (Every 255 or fewer iterations, use vpsadbw to hsum to 64-bit). Especially when you want popcnt(A[] & B[]), AVX2 is a big win. But getting auto-vectorized C++ to compile to something that does vpand->shift/mask->2xvpshufb->vpaddb without a store/reload between the AND and popcount may not be possible. – Joyous 31/7, 2017 at 21:44

@BeeOnRope: This blog post has HSW and SKL numbers for AVX2 vpshufb vs. popcnt on various array sizes, for just the popcount problem, not the AND of two bitmaps on the fly. Using that guy's hand-written asm, AVX2 is 25% faster for large buffers. – Joyous 31/7, 2017 at 21:51

For what it's worth, clang does change __builtin_popcount to an AVX version in some cases, but not in the code I linked above. It's not totally surprising that the AVX2 popcnt is faster since has a max throughput of 128-bits per cycle, limited by the single pshufb (32 nibbles) versus 64-bits per cycle for popcnt limited by it only issuing on one port. Of course, on Ryzen this is reversed and popcnt can issue 4-wide for 256-bits per cycle. So the clang code probably isn't too terrible there (and Ryzen may not have the false dep). @PeterCordes – Barytone 1/8, 2017 at 0:35

@BeeOnRope: Ryzen can still only do 2 loads per cycle, so the trick is keeping it fed (especially if you want to popcount the AND of two bitmaps). Maybe with a mix of scalar loads and movdqa->movq+pextrq? Hmm, pextrq is 2 uops. Maybe just a hybrid strategy of popcnt + AVX2, interleaved in the right ratio to max out ALU throughput with AVX2, and max out load throughput with popcnt rax, [mem]+add rdx, rax. – Joyous 1/8, 2017 at 0:47

@PeterCordes - right, good point. Well we can say that at least popcnt can easily process 128-bits/cycle from memory at the cost of 2 ops, so it's probably always better than a pure AVX2 approach (except when the inputs are already in AVX2 regs), but you might be able to squeeze out a bit more throughput with a hybrid approach. – Barytone 1/8, 2017 at 2:8

This one should work too.

std::vector<bool> v3(v1.size());
std::transform(v1.begin(), v1.end(), 
               v2.begin(), v3.begin(), std::logical_and<bool>());

Transudate answered 25/3, 2012 at 3:26 Comment(3)

This will be incredibly inefficient because each bit needs to be extracted from its word, then transformed, then inserted back into the word. The solutions that operate on a word at a time will be much more efficient. – Pontoon 3/3, 2015 at 17:8

@Joel: Standard libraries can and should specialize their std algorithm templates for vector<bool>. libc++ does for some things (e.g. std::count on vector<bool> auto-vectorizes to very good AVX2 code). But in this case, clang and gcc with libc++ and libstdc++ both make terrible code, looping a bit at a time and not even doing the AND efficiently. (Testing the bit separately in each input, and using conditional branches! godbolt.org/g/4MDd8v) (Even if you use std::bit_and instead of logical_and). – Joyous 31/7, 2017 at 5:5

Anyway, this answer isn't technically bad, it just exposes MASSIVE missed-optimizations by compilers and standard library writers. Which means in practice you shouldn't use it yet. – Joyous 31/7, 2017 at 5:6

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags