I have an algorithm that measures the width of each cell (8x8
) in a bitmap (128x128
) by counting the distance from the start of a cell to the first column within it containing only zeroes. If there is no such column, that cell is assigned a width of 8.
Here is the source-code for the program:
#include <stdint.h>
uint8_t bitmap[128][128]; // all 8 bits are used, so 0..255
int widths[16][16]; // any integer type is fine, like uint8_t
void measure_cell(int i, int j)
{
int x = i * 8;
int x_end = x + 8;
/* Horizontal sweep */
while (x < x_end) {
int y = j * 8;
int y_end = y + 8;
/* Vertical sweep */
while (y < y_end && !bitmap[y][x])
++y;
/* All-zero column? */
if (y == y_end)
break;
++x;
}
widths[j][i] = 8 - (x_end - x);
}
int main()
{
/* Load bitmap from file */
// ...
/* Calculate widths for each cell */
for (int j = 0; j < 16; ++j)
for (int i = 0; i < 16; ++i)
measure_cell(i, j);
/* Print widths to stdout */
// ...
return 0;
}
Is there any way to speed this up using SIMD? Assume I am targeting x86-64.
Vertically OR 8 vectors of row data (so a byte of the result is only 0 if all bytes in the column were zero).
_mm256_loadu_si256
and _mm256_or_si256
.
(or aligned load
if you use alignas(32)
on your array.)
If it's useful, you could instead use pmaxub
(_mm_max_epu8
) to get the largest value in that column of the cell, also being 0 only if every column is 0. But if you're not going to use the max value for anything, just use bitwise OR.
Compare / movemask to get a bitmask of which columns were all-zero.
_mm256_cmpeq_epi8(v, _mm256_setzero_si256())
/ _mm256_movemask_epi8
Split up that bitmask into 8-bit chunks (cell boundaries) and bit-scan (tzcnt
/ C++20 std::countr_zero()
/ C23 stdc_trailing_zeros
) to find the position of the first (lowest) zero.
Since you want to clamp the result to 8
, actually do stdc_trailing_zeros(mask | (1<<8))
, or I guess stdc_trailing_zeros_uc
since unsigned char
is 8 bits wide on x86.
_mm256_cmpeq_epi8
is AVX2 for __m256i
vectors, producing a 32-bit bitmask (4 cells wide).
The same thing works with just SSE2 with __m128i
vectors (16 bit mask = 2 cells).
If you have AVX-512, you could vectorize the bit-scan part, too. 63 - vplzcntq
can be used as a tzcnt if you use a bithack to isolate the lowest set bit. (And special-case to get 64 instead of -1 I guess?). You don't need to booleanize into a vector mask (which would restrict you to 256-bit vectors since AVX-512 can only compare into mask registers). Instead just find the position of the lowest set bit in the 64-bit element, and right-shift by 3 to get index of lowest non-zero byte. That should conveniently clear the low 3 bits that have unwanted garbage for bit-index inside a byte.
Or store masks to temporary storage somewhere (perhaps the width
array before replacing it with counts). Then later reload with vectors for a vectorized tzcnt
per byte. There isn't a vplzcntb
, but a bithack can efficiently set up for vpopcntb
(AVX512BITALG in Ice Lake and later). We'd want to turn the trailing 0
s into 1
s and clear everything else.
(~v) & (v-1)
makes a mask up to and not including the lowest 1
bit, or all-ones for an input of zero. (Using _mm512_sub_epi8
element size for the subtraction of course.) This is similar to (x-1) ^ x
(scalar blsmsk
) which makes a mask up to and including the lowest 1
bit.
See Trying to write a vectorized implementation of Gerd Isenberg's Bit Scan Forward as an exercise for more details.
Also related: https://catonmat.net/low-level-bit-hacks for bithack basics, and https://web.archive.org/web/20231120194321/https://graphics.stanford.edu/%7Eseander/bithacks.html#ZerosOnRightLinear for some bithacks to get ctz directly, rather than feed vpopcntb
. (But that's probably worse: it would still I think need 3 steps of shift/mask/something, and x86 doesn't have 8-bit SIMD shifts.)
Even with just AVX2, there's probably something you can do with vpshufb
lookup tables (like the popcount strategy: https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp), or perhaps DeBruijn sequences with 16-bit multiplies after isolating the lowest set bit. See my vectorized BSF answer I linked in the previous section.
This might not be worth it vs. just doing four scalar tzcnt
operations since each 32 bits of mask data ends up in a GPR anyway after vpmovmskb
. That's certainly easier and might be fast enough even if it's possible to go a bit faster even without AVX-512.