I have an algorithm that measures the width of each cell (8x8) in a bitmap (128x128) by counting the distance from the start of a cell to the first column within it containing only zeroes. If there is no such column, that cell is assigned a width of 8.
Here is the source-code for the program:
#include <stdint.h>
uint8_t bitmap[128][128]; // all 8 bits are used, so 0..255
int widths[16][16]; // any integer type is fine, like uint8_t
void measure_cell(int i, int j)
{
int x = i * 8;
int x_end = x + 8;
/* Horizontal sweep */
while (x < x_end) {
int y = j * 8;
int y_end = y + 8;
/* Vertical sweep */
while (y < y_end && !bitmap[y][x])
++y;
/* All-zero column? */
if (y == y_end)
break;
++x;
}
widths[j][i] = 8 - (x_end - x);
}
int main()
{
/* Load bitmap from file */
// ...
/* Calculate widths for each cell */
for (int j = 0; j < 16; ++j)
for (int i = 0; i < 16; ++i)
measure_cell(i, j);
/* Print widths to stdout */
// ...
return 0;
}
Is there any way to speed this up using SIMD? Assume I am targeting x86-64.
Vertically OR 8 vectors of row data (so a byte of the result is only 0 if all bytes in the column were zero).
_mm256_loadu_si256 and _mm256_or_si256.
(or aligned load if you use alignas(32) on your array.)
If it's useful, you could instead use pmaxub (_mm_max_epu8) to get the largest value in that column of the cell, also being 0 only if every column is 0. But if you're not going to use the max value for anything, just use bitwise OR.
Compare / movemask to get a bitmask of which columns were all-zero.
_mm256_cmpeq_epi8(v, _mm256_setzero_si256()) / _mm256_movemask_epi8
Split up that bitmask into 8-bit chunks (cell boundaries) and bit-scan (tzcnt / C++20 std::countr_zero() / C23 stdc_trailing_zeros) to find the position of the first (lowest) zero.
Since you want to clamp the result to 8, actually do stdc_trailing_zeros(mask | (1<<8)), or I guess stdc_trailing_zeros_uc since unsigned char is 8 bits wide on x86.
_mm256_cmpeq_epi8 is AVX2 for __m256i vectors, producing a 32-bit bitmask (4 cells wide).
The same thing works with just SSE2 with __m128i vectors (16 bit mask = 2 cells).
If you have AVX-512, you could vectorize the bit-scan part, too. 63 - vplzcntq can be used as a tzcnt if you use a bithack to isolate the lowest set bit. (And special-case to get 64 instead of -1 I guess?). You don't need to booleanize into a vector mask (which would restrict you to 256-bit vectors since AVX-512 can only compare into mask registers). Instead just find the position of the lowest set bit in the 64-bit element, and right-shift by 3 to get index of lowest non-zero byte. That should conveniently clear the low 3 bits that have unwanted garbage for bit-index inside a byte.
Or store masks to temporary storage somewhere (perhaps the width array before replacing it with counts). Then later reload with vectors for a vectorized tzcnt per byte. There isn't a vplzcntb, but a bithack can efficiently set up for vpopcntb (AVX512BITALG in Ice Lake and later). We'd want to turn the trailing 0s into 1s and clear everything else.
(~v) & (v-1) makes a mask up to and not including the lowest 1 bit, or all-ones for an input of zero. (Using _mm512_sub_epi8 element size for the subtraction of course.) This is similar to (x-1) ^ x (scalar blsmsk) which makes a mask up to and including the lowest 1 bit.
See Trying to write a vectorized implementation of Gerd Isenberg's Bit Scan Forward as an exercise for more details.
Also related: https://catonmat.net/low-level-bit-hacks for bithack basics, and https://web.archive.org/web/20231120194321/https://graphics.stanford.edu/%7Eseander/bithacks.html#ZerosOnRightLinear for some bithacks to get ctz directly, rather than feed vpopcntb. (But that's probably worse: it would still I think need 3 steps of shift/mask/something, and x86 doesn't have 8-bit SIMD shifts.)
Even with just AVX2, there's probably something you can do with vpshufb lookup tables (like the popcount strategy: https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp), or perhaps DeBruijn sequences with 16-bit multiplies after isolating the lowest set bit. See my vectorized BSF answer I linked in the previous section.
This might not be worth it vs. just doing four scalar tzcnt operations since each 32 bits of mask data ends up in a GPR anyway after vpmovmskb. That's certainly easier and might be fast enough even if it's possible to go a bit faster even without AVX-512.