I'm trying to use the SIMD group reduction/prefix functions in a series of reasonably complex compute kernels in a Mac app. I need to allocate some threadgroup memory for coordinating between SIMD groups in the same thread group. This array should therefore should have a capacity depending on [[simdgroups_per_threadgroup]]
, but that's not a compile time value, so it can't be used as an array dimension.
Now, according to various WWDC session videos, threadExecutionWidth
on the pipeline object should return the SIMD group size, with which I could then allocate an appropriate amount of memory using setThreadgroupMemoryLength:atIndex:
on the compute encoder.
This works consistently on some hardware (e.g. Apple M1, threadExecutionWidth
always seems to report 32) but I'm hitting configurations where threadExecutionWidth
does not match apparent SIMD group size, causing runtime errors due to out of bounds access. (e.g. on Intel UHD Graphics 630, threadExecutionWidth
= 16 for some complex kernels, although SIMD group size seems to be 32)
So:
If the latter is at least true, I can presumably trust threadExecutionWidth
for the most trivial of kernels? Or should I submit a trivial kernel to the GPU which returns [[threads_per_simdgroup]]
?
I suspect the problem might occur in kernels where Metal offers an "odd" (non-pow2) maximum thread group sizes, although in the case I'm encountering, the maximum threadgroup size is reported as 896, which is an integer multiple of 32, so it's not as if it's using the greatest common denominator between max threadgroup size and SIMD group size for threadExecutionWidth
.
I never found a particularly satisfying solution to this, but I did at least find an effective one:
threadExecutionWidth
.simdgroups_per_threadgroup
. If it matches, great, run the rest of the kernel.device
argument memory buffer. Then early-out of the compute kernel.device
memory. If so inspect the reported SIMD group size, adjust buffer allocations, then re-run the kernel with the new value.For the truly paranoid, it may be wise to make the check in step 2 a lower or upper bound or perhaps a range, rather than an equality check: e.g., the allocated memory is safe for SIMD group sizes up to or from N threads. That way, if threadgroup buffer allocations should change simdgroups_per_threadgroup
(😱) you don't end up bouncing backwards and forwards between vaulues, making no progress.
Also pay attention to what you do in SIMD groups: not all GPU models support SIMD group reduction functions, even if they support SIMD permutations, so ship alternate versions of kernels for such older GPUs if necessary.
Finally, I've found most GPUs to report SIMD group sizes of 32 threads, but Intel Iris Graphics 6100 from ~2015 MacBook Pros reports a simdgroups_per_threadgroup
(and threadExecutionWidth
) value of 8. (And it doesn't support SIMD reduction functions, but does support SIMD permutation functions including simd_ballot()
which can be almost as effective as reductions for some algorithms.)