"Isolate" specific Row/Column/Diagonal from a 64-bit number
Asked Answered
C

9

34

OK, let's consider a 64-bit number, with its bits forming a 8x8 table.

E.g.

0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0

written as

a b c d e f g h
----------------
0 1 1 0 1 0 1 0
0 1 1 0 1 0 1 1 
0 1 1 1 1 0 1 0 
0 1 1 0 1 0 1 0 
1 1 1 0 1 0 1 0 
0 1 1 0 1 0 1 0 
0 1 1 0 1 1 1 0 
0 1 1 0 1 0 1 0

Now, what if we want to isolate JUST e.g. column d (00100000) (or any row/diagonal for that matter) ?

Can this be done? And if so, how?


HINTS :

  • (a) My main objective here - though not initially mentioned - is raw speed. I'm searching for the fastest algorithm around, since the "retrieval" function is being performed some millions of times per second.

  • (b) This is what comes closer to what I mean : https://www.chessprogramming.org/Kindergarten_Bitboards

Corvus answered 26/1, 2013 at 14:25 Comment(9)
Did you really just ask "can it be done"? Do you seriously expect the answer to be "no, this is completely impossible"?Koss
@KerrekSB Nope, I'm actually asking how... lol.Corvus
Of course it can be done. You do this with a bit of shift, and and or. Depending on exactly what you actually want to do [or is this a school exercise?], there may be different types of "clever tricks" that can be applied, but shifts, and's and or's are probably what you want. Do you want me to post an answer with that>Imogen
This is just a truth table...Cochleate
I imagine you could have a function get(board, row, col) returning the relevant bit. Then you could call it in a loop and combine the results into a single byte. Would this be in some way unsatisfactory?Epimenides
@Epimenides Yep, it is. My objective is to use no loops at all; I'm actually looking for the fastest possible way.Corvus
@MatsPetersson This is nothing like homework. And shifts, etc are ok. However, I'm looking for the fastest possible solution. (this function is going to be performed some millions of times per second, so speed is crucial ;-))Corvus
You need to be more clear. When you say "isolate" do you want (1) the 8 selected bits in the low 8 bits of the result or (2) the 8 bits in their original positions and all the non-selected bits 0? When I used to play with the chess 0.5 program it used the second meaning.Linneman
@Dr.Kameleon: Loops and the fastest possible are not necessarily mutually exclusive. See my answer.Epimenides
R
66

Here's a solution with only 4 main steps:

const uint64_t column_mask = 0x8080808080808080ull;
const uint64_t magic = 0x2040810204081ull;

int get_col(uint64_t board, int col) {
    uint64_t column = (board << col) & column_mask;
    column *= magic;
    return (column >> 56) & 0xff;
}

It works like this:

  • the board is shifted to align the column with the left side
  • it's masked to only contain the required column (0..8)
  • it's multiplied by a magic number which results in all the original bits pushed to the left side
  • the left-most byte is shifted to the right

The magic number is chosen to copy only the needed bits and let the rest fall into unused places / overflow over the number. The process looks like this (digits are bit "IDs", rather than the number itself):

original column: ...1.......2.......3.......4.......5.......6.......7.......8....
aligned column:  1.......2.......3.......4.......5.......6.......7.......8.......
multiplied:      123456782345678.345678..45678...5678....678.....78......8.......
shifted to right:........................................................12345678

If you add the const keywords, assembly becomes quite nice actually:

get_col:
.LFB7:
        .cfi_startproc
        movl    %esi, %ecx
        movabsq $-9187201950435737472, %rax
        salq    %cl, %rdi
        andq    %rax, %rdi
        movabsq $567382630219905, %rax
        imulq   %rax, %rdi
        shrq    $56, %rdi
        movl    %edi, %eax
        ret

No branching, no external data, around 0.4ns per calculation.

Edit: takes around 6th of the time using NPE's solution as baseline (next fastest one)

Retread answered 26/1, 2013 at 16:46 Comment(5)
(+1) Beautiful. Would you mind explaining the process for arriving at the magic number? Thanks.Epimenides
It involved some python for inspecting results and trying to "visually" arrive at the right solution. Basically I thought about binary multiplication as "shift+copy, repeat for each bit set". Then bit "1" needs magic="1", bits "12" need magic="10000001", etc. then the pattern is just repeating. The magic number is 10000001000000100000010000001000000100000010000001.Retread
That's what I call a great answer. Thanks to everyone who spend time on my question! :-)Corvus
@viraptor: I liked your multiplication trick so much that I've posted a separate question about it: #14547587Epimenides
This can be done "magic bitboard" style where the first shift is omitted, and you only do a mask, multiply, and shift. You'd have to store several masks and magic multipliers, however. So it's a trade off, but it's one that appears to be faster on most machines without PEXT. (The PEXT solution is by far the fastest.)Diu
I
7

Right, so to "settle" the debate as to which is faster/slower/etc, I've put all the code into one program [and I hope I've credited the right person for the right code-snippet].

The code can be found below, for inspection that I've intepreded the code correctly when I've made it into functions. I did run it wout proper output and check that each function gives the same result [bearing in mind that the order is slightly different in some cases - so I made a variation to run the other way of my code, just to see that it gives the "right" result]. So without further ado, here's the results:

mats1 time in clocks per iteration 10.3457
mats2 time in clocks per iteration 10.4785
mats3 time in clocks per iteration 10.5538
viraptor time in clocks per iteration 6.24603
lemees time in clocks per iteration 14.4818
npe time in clocks per iteration 13.1455
alex time in clocks per iteration 24.8272

(viraptor's results from core i5, g++ 4.7)

mats1 time in clocks per iteration 7.62338
mats2 time in clocks per iteration 7.36226
mats3 time in clocks per iteration 7.45361
viraptor time in clocks per iteration 2.09582
lemees time in clocks per iteration 9.43744
npe time in clocks per iteration 7.51016
alex time in clocks per iteration 19.3554

(viraptor's results from core i5, clang++ 3.2)

mats1 time in clocks per iteration 12.956
mats2 time in clocks per iteration 13.4395
mats3 time in clocks per iteration 13.3178
viraptor time in clocks per iteration 2.12914
lemees time in clocks per iteration 13.9267
npe time in clocks per iteration 16.2102
alex time in clocks per iteration 13.8705

That's clock-cycles on a 3.4GHz AMD Athlon2 - I don't have a modern Intel machine - if someone wishes to run the code on that, I'd be interested to see what it looks like. I'm fairly sure all of it runs well within the cache - perhaps aside from fetching some of the values in to check.

So, the winner is clearly viraptor, by about 40% - "my" code is second. Alex's code doesn't have any jumps/branches, but it appears to run slower than the other alternatives still. Not sure why npe's results are that much slower than mine - it does almost the same thing (and the code looks very similar when looking at the assembler output from g++).

#include <iostream>
#include <fstream>
#include <cstdint>

using namespace std;

const int SIZE = 1000000;

uint64_t g_val[SIZE];

ofstream nulloutput;

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

#define BITA_TO_B(x, a, b) (((x) >> (a-b)) & (1 << b))

unsigned char get_col_mats1(uint64_t val, int col)
{
    return BITA_TO_B(val, 56+col, 7) |
    BITA_TO_B(val, 48+col, 6) |
    BITA_TO_B(val, 40+col, 5) |
    BITA_TO_B(val, 32+col, 4) |
    BITA_TO_B(val, 24+col, 3) |
    BITA_TO_B(val, 16+col, 2) |
    BITA_TO_B(val, 8+col, 1) |
    BITA_TO_B(val, 0+col, 0);
}

unsigned char get_col_mats2(uint64_t val, int col)
{
    return BITA_TO_B(val, 63-col, 7) |
    BITA_TO_B(val, 55-col, 6) |
    BITA_TO_B(val, 47-col, 5) |
    BITA_TO_B(val, 39-col, 4) |
    BITA_TO_B(val, 31-col, 3) |
    BITA_TO_B(val, 23-col, 2) |
    BITA_TO_B(val, 15-col, 1) |
    BITA_TO_B(val, 7-col, 0);
}


unsigned char get_col_viraptor(uint64_t board, int col) {
    const uint64_t column_mask = 0x8080808080808080ull;
    const uint64_t magic = 0x2040810204081ull ;
    uint64_t column = board & (column_mask >> col);
    column <<= col;
    column *= magic;
    return (column >> 56) & 0xff;
}


unsigned char get_col_alex(uint64_t bitboard, int col)
{
    unsigned char result;
    result |= (bitboard & (1ULL << 63-col)) ? 0x80 : 0;
    result |= (bitboard & (1ULL << 55-col)) ? 0x40 : 0;
    result |= (bitboard & (1ULL << 47-col)) ? 0x20 : 0;
    result |= (bitboard & (1ULL << 39-col)) ? 0x10 : 0;
    result |= (bitboard & (1ULL << 31-col)) ? 0x08 : 0;
    result |= (bitboard & (1ULL << 23-col)) ? 0x04 : 0;
    result |= (bitboard & (1ULL << 15-col)) ? 0x02 : 0;
    result |= (bitboard & (1ULL << 7-col))  ? 0x01 : 0;

    return result;
}

unsigned char get_col_lemees(uint64_t val, int column)
{
    int result = 0;
    int source_bitpos = 7 - column; // "point" to last entry in this column
    for (int target_bitpos = 0; target_bitpos < 8; ++target_bitpos)
    {
    bool bit = (val >> source_bitpos) & 1;  // "extract" bit
    result |= bit << target_bitpos;            // add bit if it was set
    source_bitpos += 8;                        // move one up in table
    }
    return result;
}

int get(uint64_t board, int row, int col) {
  return (board >> (row * 8 + col)) & 1;
}

uint8_t get_col_npe(uint64_t board, int col) {
  uint8_t ret = 0;
  for (int i = 0; i < 8; ++i) {
    ret = (ret << 1) + get(board, i, col);
  }
  return ret;
}



#define BITA_TO_B2(x, a, b) (((x) >> (a-b)) & (1 << b))

unsigned char get_col_mats3(uint64_t val, int col)
{
    return BITA_TO_B2(val, 63-col, 7) |
    BITA_TO_B2(val, 55-col, 6) |
    BITA_TO_B2(val, 47-col, 5) |
    BITA_TO_B2(val, 39-col, 4) |
    BITA_TO_B2(val, 31-col, 3) |
    BITA_TO_B2(val, 23-col, 2) |
    BITA_TO_B2(val, 15-col, 1) |
    BITA_TO_B2(val, 7-col, 0);
}

template<unsigned char (*f)(uint64_t val, int col)>
void runbench(const char *name)
{
    unsigned char col[8]  = {0};
    uint64_t long t = rdtsc();
    for(int j = 0; j < SIZE; j++)
    {
    uint64_t val = g_val[j];
    for(int i = 0; i < 8; i++)
    {
        col[i] += f(val, i);
    }
//  __asm__ __volatile__("":::"memory");
    }
    t = rdtsc() - t;
    for(int i = 0; i < 8; i++)
    {
    nulloutput<< "col " << i << " has bits " << hex << (int)col[i] << endl;
    }
    cout << name << " time in clocks per iteration " << dec << t / (8.0 * SIZE) << endl;
}

#define BM(name) void bench_##name() { runbench<get_col_##name>(#name); }

BM(mats1);
BM(mats2);
BM(mats3);
BM(viraptor);
BM(lemees);
BM(npe);
BM(alex);

struct function
{
    void (*func)(void);
    const char *name;
};


#define FUNC(f) { bench_##f, #f }

function funcs[] = 
{
    FUNC(mats1),
    FUNC(mats2),
    FUNC(mats3),
    FUNC(viraptor),
    FUNC(lemees),
    FUNC(npe),
    FUNC(alex),
}; 


int main()
{
    unsigned long long a, b;
    int i;
    int sum = 0;

    nulloutput.open("/dev/nul");
    for(i = 0; i < SIZE; i++)
    {
    g_val[i] = rand() + ((long)rand() << 32L);
    }
    unsigned char col[8];

    for(i = 0; i < sizeof(funcs)/sizeof(funcs[0]); i++)
    {
    funcs[i].func();
    }
}
Imogen answered 26/1, 2013 at 18:13 Comment(4)
I was going to try running it with clang as well, but for some reason clang doesn't like iostream and crashes. It wasn't doing that yesterday! :(Imogen
Thanks for collecting the solutions. Hope you don't mind the edit - added results from core i5 gcc and clang.Retread
Good. Interesting that clang isn't as good as gcc - most people seem to like clang...Imogen
Ah, i would have templatised my implementation... i will benchmark on Wednesday.Cochleate
E
2

Code it up with straightforward loops and let the optimizer do the inlining and loop unrolling for you.

Compiled using 4.7.2 with -O3, on my box the following can perform about 300 million get_col() calls per second.

bitboard.cpp:

#include <cinttypes>
#include <iostream>

int get(uint64_t board, int row, int col) {
  return (board >> (row * 8 + col)) & 1;
}

uint8_t get_col(uint64_t board, int col) {
  uint8_t ret = 0;
  for (int i = 0; i < 8; ++i) {
    ret = (ret << 1) + get(board, i, col);
  }
  return ret;
}

extern uint64_t board;
extern int sum;

extern void f();

int main() {
  for (int i = 0; i < 40000000; ++i) {
    for (int j = 0; j < 8; ++j) {
      sum += get_col(board, j);
    }
    f();
  }
  std::cout << sum << std::endl;
}

bitboard_b.cpp:

#include <cinttypes>

uint64_t board = 0x1234567890ABCDEFull;
int sum = 0;

void f() {}

If you look at the assembly code for get_col(), you'll see that it contains zero loops and is probably as efficient as anything you're likely to handcraft:

__Z7get_colyi:
LFB1248:
        movl    %esi, %ecx
        movq    %rdi, %rax
        movq    %rdi, %rdx
        shrq    %cl, %rax
        leal    8(%rsi), %ecx
        andl    $1, %eax
        shrq    %cl, %rdx
        leal    16(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    24(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    32(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    40(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %edx
        movq    %rdi, %rax
        shrq    %cl, %rax
        leal    48(%rsi), %ecx
        andl    $1, %eax
        leal    (%rax,%rdx,2), %edx
        movq    %rdi, %rax
        shrq    %cl, %rax
        leal    56(%rsi), %ecx
        andl    $1, %eax
        leal    (%rax,%rdx,2), %eax
        shrq    %cl, %rdi
        andl    $1, %edi
        leal    (%rdi,%rax,2), %eax
        ret

This is not meant a complete implementation, just an rough illustration of the idea. In particular, the ordering of bits may be the opposite of what you expect, etc.

Epimenides answered 26/1, 2013 at 14:46 Comment(7)
What is this NOOP f() call for? Prevent optimization?Douche
@leemes: Yes. Otherwise gcc optimizes the 40000000 loop out by multiplying the result of of the inner loop by 40000000.Epimenides
Is ret = (ret << 1) + ... as fast as ret |= ... << i? Can you please compare, now that you already did some little benchmark here ;)Douche
@Epimenides Please, have a look at my last edit (in the original post), to see what I mean by speed.Corvus
@Dr.Kameleon: Just to be clear, you are saying that 300 million calls per second is insufficiently fast? In that case you need to clearly state what level of performance you're seeking.Epimenides
@AlexChamberlain: Please show us how and please include some benchmarks. The question specifically asks for the highest performance, and anything that's not been benchmarked is a pure speculation.Epimenides
@leemes: About 20% slower. Haven't analysed to see why.Epimenides
D
1

In your case (specialized for 8x8 = 64 bit tables), you need to perform bitshifting to extract the specific bits and re-arrange them in a new integer variable, also using bitshifting:

uint64_t matrix = ... // input
int column = 3; // "d"-column

int result = 0;
int source_bitpos = 7 - column; // "point" to last entry in this column
for (int target_bitpos = 0; target_bitpos < 8; ++target_bitpos)
{
    bool bit = (matrix >> source_bitpos) & 1;  // "extract" bit
    result |= bit << target_bitpos;            // add bit if it was set
    source_bitpos += 8;                        // move one up in table
}

See here: http://ideone.com/UlWAAO

Douche answered 26/1, 2013 at 14:32 Comment(2)
Wouldn't bit masking and then shifting the whole lot be simpler?Thimblerig
@Thimblerig Since the matrix is given row-major and he asks about a column, you have to rearrange the bits. Or did I misinterpret it?Douche
I
1
#define BITA_TO_B(x, a, b) (((x) >> (a)) & 1) << b;

unsigned long long x = 0x6A6B7A6AEA6E6A;

unsigned char col_d = BITA_TO_B(x, 60, 7) |
                      BITA_TO_B(x, 52, 6) |
                      BITA_TO_B(x, 44, 5) |
                      BITA_TO_B(x, 36, 4) |
                      BITA_TO_B(x, 28, 3) |
                      BITA_TO_B(x, 20, 2) |
                      BITA_TO_B(x, 12, 1) |
                      BITA_TO_B(x,  4, 0);

Perhaps a somewhat more optimized idea:

#define BITA_TO_B(x, a, b) (((x) >> (a-b)) & (1 << b));

If b is a constant, this will perform slightly better.

Another way may be:

unsigned long xx = x & 0x10101010101010; 
col_d = (xx >> 53) | (xx >> 46) | (xx >> 39) ... (xx >> 4); 

Doing one "and" rather than many helps speed it up.

Imogen answered 26/1, 2013 at 14:36 Comment(8)
Looks like we had similar ideas... I decided to use constants instead of shifting x as I thought it would perform slightly better? Thoughts?Cochleate
I was hoping the compiler would solve that. Of course one could optimise it to actually produce the right bit in the right place. Edit upcoming.Imogen
It's possible to do it with no shifts whatsoever.Cochleate
Is that some clever bitfiddling trick? [And not using "multiply" or "divide", which I'm pretty sure is worse than shifts!]Imogen
Uhm, yes, but it uses conditionals instead, which I'm not sure is a win.Imogen
@MatsPetersson: I urge both you and Alex instead of guessing to benchmark your solutions and post some hard performance figures.Epimenides
@Epimenides We're not guessing, we're using our experience to design the best solution. Besides... I haven't got access to a compiler atm.Cochleate
Who is downvoting this reply? Please state WHY!Imogen
C
1

You can transpose the number, and then simply select the relevant column, which is now a row, and therefore contiguous bits, as you wanted.

In my tests it wasn't much better than ORing together 8 individually selected bits, but it is much better if you intend to select multiple columns (since the transpose is the limiting factor).

Cereus answered 26/1, 2013 at 15:5 Comment(0)
O
1

Here's a solution that can execute once a cycle (if the value and mask are in registers), if you are willing to use the intrinsic for the PEXT instruction on Intel (and if you are doing bitboard stuff, you likely are):

int get_col(uint64_t board) {
    return _pext_u64(board, 0x8080808080808080ull);
}

That's for the 0th column - if you want another one just shift the mask appropriately. Of course, this is cheating by using a hardware specific instruction, but bitboards are all about cheating.

Oospore answered 26/10, 2016 at 4:33 Comment(0)
C
0

How about this...

uint64_t bitboard = ...;
uint8_t result = 0;

result |= (bitboard & (1ULL << 60)) ? 0x80 : 0;
result |= (bitboard & (1ULL << 52)) ? 0x40 : 0;
result |= (bitboard & (1ULL << 44)) ? 0x20 : 0;
result |= (bitboard & (1ULL << 36)) ? 0x10 : 0;
result |= (bitboard & (1ULL << 28)) ? 0x08 : 0;
result |= (bitboard & (1ULL << 20)) ? 0x04 : 0;
result |= (bitboard & (1ULL << 12)) ? 0x02 : 0;
result |= (bitboard & (1ULL <<  4)) ? 0x01 : 0;
Cochleate answered 26/1, 2013 at 14:38 Comment(0)
S
0

This one is from the Chess Programming Wiki. It transposes the board, after which isolating a single row is trivial. It also lets you go back the other way.

/**
 * Flip a bitboard about the antidiagonal a8-h1.
 * Square a1 is mapped to h8 and vice versa.
 * @param x any bitboard
 * @return bitboard x flipped about antidiagonal a8-h1
 */
U64 flipDiagA8H1(U64 x) {
   U64 t;
   const U64 k1 = C64(0xaa00aa00aa00aa00);
   const U64 k2 = C64(0xcccc0000cccc0000);
   const U64 k4 = C64(0xf0f0f0f00f0f0f0f);
   t  =       x ^ (x << 36) ;
   x ^= k4 & (t ^ (x >> 36));
   t  = k2 & (x ^ (x << 18));
   x ^=       t ^ (t >> 18) ;
   t  = k1 & (x ^ (x <<  9));
   x ^=       t ^ (t >>  9) ;
   return x;
}
Steerage answered 8/4, 2014 at 11:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.