I have a very simple kernel, in Metal Shading Language, that uses shared (threadgroup) memory but which is not working, and I can't understand why. It just writes a value to shared memory, and then read and outputs it:
kernel void test_shared(
uint thread_index [[thread_position_in_grid]],
uint local_index [[thread_index_in_threadgroup]],
uint group_index [[threadgroup_position_in_grid]],
device float* output [[buffer(0)]],
threadgroup float shared[32])
{
float value = 1;
shared[local_index] = value;
threadgroup_barrier(mem_flags::mem_threadgroup); /// <-- important line
value = shared[local_index];
if(local_index == 0) output[group_index] = value;
}
I launch 100 threadgroups, each with 32 threads via this C++ code:
void test()
{
const unsigned int bufsize = sizeof(float) * 100;
NS::SharedPtr<MTL::Buffer> buf = NS::TransferPtr(device.newBuffer(bufsize, MTL::ResourceStorageModeShared));
MTL::CommandBuffer* cmd = queue->commandBuffer();
MTL::ComputeCommandEncoder* encoder = cmd->computeCommandEncoder();
encoder->setBuffer(buf.get(), 0, 0);
encoder->setComputePipelineState(kernel_test_shared.get());
encoder->dispatchThreadgroups(MTL::Size(100, 1, 1), MTL::Size(32, 1, 1));
encoder->endEncoding();
cmd->commit();
cmd->waitUntilCompleted();
const float* dat = (const float*)buf->contents();
printf("buf =\n");
for(unsigned int k = 0; k < 100; k++)
printf(" [%u] %f\n", k, dat[k]);
}
About the output at the end of test()
:
threadgroup_barrier(...)
, but I suspect this would be the result of a compiler optimization.Why would the threadgroup_barrier
seemingly clear the shared memory to 0?
If you are going to use threadgroup memory, you need to use setThreadgroupMemoryLength
.
I would also suggest explicitly specifying the [[threadgroup(n)]]
binding point.