[SOLVED] Rotate and reflect a 5x5 bitboard

Rotate and reflect a 5x5 bitboard

I am trying to find a fast way to rotate and reflect a 5x5 board to store it in a transposition table. The board is represented as a bitboard as they are very fast.

The bitboard is represented like this:

 20 21 22 23 24
 15 16 17 18 19
 10 11 12 13 14
 05 06 07 08 09
 00 01 02 03 04

I have have found some solutions for 8x8 bitboards https://www.chessprogramming.org/Flipping_Mirroring_and_Rotating but I can't find a solution that works for a 5x5 bitboard, I also have tried looping through all of the bits and switching them but that was a very slow solution. (C++)

Solution

Transpose could be done with 4 delta swaps, it does not decompose as nicely as an 8x8 transpose because 5 is not a power of two. 4 delta swaps would cost about 24 operations and about 20 cycles (less than 24 thanks to instruction level parallelism, but not a lot less than 24 because the code would be mostly serial due to dependencies).

That would look like this:

board = bit_permute_step(board, 0x00006300, 16);
board = bit_permute_step(board, 0x020a080a, 4);
board = bit_permute_step(board, 0x0063008c, 8);
board = bit_permute_step(board, 0x00006310, 16);

Where bit_permute_step is the familiar delta swap.

An alternative way is to use multiplication to extract columns. For example x & 0b0000100001000010000100001 selects one column, then we can choose a constant to multiply by such that the bits from the column line up in a contiguous sequence in the top 5 bits of a 32-bit word. The rest of the word would contain other stray copies of those bits, but they will be shifted out. For example (shown in C#, should be easy to port)

static uint Tranpose_5x5(uint x)
{
    uint r0 = ((x & 0b0000100001000010000100001) * 0b00001000_10001000_10001000_00000000) >> 27;
    uint r1 = ((x & 0b0001000010000100001000010) * 0b00000100_01000100_01000100_00000000) >> 27;
    uint r2 = ((x & 0b0010000100001000010000100) * 0b00000010_00100010_00100010_00000000) >> 27;
    uint r3 = ((x & 0b0100001000010000100001000) * 0b00000001_00010001_00010001_00000000) >> 27;
    uint r4 = ((x & 0b1000010000100001000010000) * 0b00000000_10001000_10001000_10000000) >> 27;
    return r0 | (r1 << 5) | (r2 << 10) | (r3 << 15) | (r4 << 20);
}

At 23 operations, this doesn't seem to have saved anything, but it could have a lower latency thanks to the increased opportunity for instruction level parallelism.

PEXT makes that nicer/simpler: (not tested)

int r0 = _pext_u32(board, 0b0000100001000010000100001);
int r1 = _pext_u32(board, 0b0001000010000100001000010);
int r2 = _pext_u32(board, 0b0010000100001000010000100);
int r3 = _pext_u32(board, 0b0100001000010000100001000);
int r4 = _pext_u32(board, 0b1000010000100001000010000);
return r0 | (r1 << 5) | (r2 << 10) | (r3 << 15) | (r4 << 20);

64-bit pext would enable extracting two columns at once, but we have to concatenate the board with itself first because pext cannot reorder: (also not tested)

uint64_t b2 = board | (uint64_t(board) << 25);
int r01 = _pext_u64(b2, 0b0001000010000100001000010'0000100001000010000100001);
int r23 = _pext_u64(b2, 0b0100001000010000100001000'0010000100001000010000100);
int r4 = _pext_u32(board, 0b1000010000100001000010000);
return r01 | (r23 << 10) | (r4 << 20);

Flipping horizontally or vertically would be easy with bit-group moving.