I am trying to find a fast way to rotate and reflect a 5x5 board to store it in a transposition table. The board is represented as a bitboard as they are very fast.
The bitboard is represented like this:
20 21 22 23 24
15 16 17 18 19
10 11 12 13 14
05 06 07 08 09
00 01 02 03 04
I have have found some solutions for 8x8 bitboards https://www.chessprogramming.org/Flipping_Mirroring_and_Rotating but I can't find a solution that works for a 5x5 bitboard, I also have tried looping through all of the bits and switching them but that was a very slow solution. (C++)
Transpose could be done with 4 delta swaps, it does not decompose as nicely as an 8x8 transpose because 5 is not a power of two. 4 delta swaps would cost about 24 operations and about 20 cycles (less than 24 thanks to instruction level parallelism, but not a lot less than 24 because the code would be mostly serial due to dependencies).
That would look like this:
board = bit_permute_step(board, 0x00006300, 16);
board = bit_permute_step(board, 0x020a080a, 4);
board = bit_permute_step(board, 0x0063008c, 8);
board = bit_permute_step(board, 0x00006310, 16);
Where bit_permute_step
is the familiar delta swap.
An alternative way is to use multiplication to extract columns. For example x & 0b0000100001000010000100001
selects one column, then we can choose a constant to multiply by such that the bits from the column line up in a contiguous sequence in the top 5 bits of a 32-bit word. The rest of the word would contain other stray copies of those bits, but they will be shifted out. For example (shown in C#, should be easy to port)
static uint Tranpose_5x5(uint x)
{
uint r0 = ((x & 0b0000100001000010000100001) * 0b00001000_10001000_10001000_00000000) >> 27;
uint r1 = ((x & 0b0001000010000100001000010) * 0b00000100_01000100_01000100_00000000) >> 27;
uint r2 = ((x & 0b0010000100001000010000100) * 0b00000010_00100010_00100010_00000000) >> 27;
uint r3 = ((x & 0b0100001000010000100001000) * 0b00000001_00010001_00010001_00000000) >> 27;
uint r4 = ((x & 0b1000010000100001000010000) * 0b00000000_10001000_10001000_10000000) >> 27;
return r0 | (r1 << 5) | (r2 << 10) | (r3 << 15) | (r4 << 20);
}
At 23 operations, this doesn't seem to have saved anything, but it could have a lower latency thanks to the increased opportunity for instruction level parallelism.
PEXT makes that nicer/simpler: (not tested)
int r0 = _pext_u32(board, 0b0000100001000010000100001);
int r1 = _pext_u32(board, 0b0001000010000100001000010);
int r2 = _pext_u32(board, 0b0010000100001000010000100);
int r3 = _pext_u32(board, 0b0100001000010000100001000);
int r4 = _pext_u32(board, 0b1000010000100001000010000);
return r0 | (r1 << 5) | (r2 << 10) | (r3 << 15) | (r4 << 20);
64-bit pext would enable extracting two columns at once, but we have to concatenate the board with itself first because pext cannot reorder: (also not tested)
uint64_t b2 = board | (uint64_t(board) << 25);
int r01 = _pext_u64(b2, 0b0001000010000100001000010'0000100001000010000100001);
int r23 = _pext_u64(b2, 0b0100001000010000100001000'0010000100001000010000100);
int r4 = _pext_u32(board, 0b1000010000100001000010000);
return r01 | (r23 << 10) | (r4 << 20);
Flipping horizontally or vertically would be easy with bit-group moving.