I have an application where I need to broadcast a single (non-constant, just plain old data) value in global memory to all threads. The threads only need to read the value, not write to it. I cannot explicitly tell the application to use the constant cache (with e.g. cudaMemcpyToSymbol) because I am using a memory-wrapping library that does not give me explicit low-level control.
I am wondering how this broadcast takes place under the hood, and how it may differ from the usual access pattern where each thread accesses a unique global memory location (for simplicity assume that this "usual" access pattern is coalesced). I am especially interested in any implicit serializations that may take place in the broadcast case, and how this may be affected by different architectures.
For example, for Fermi, presumably the first thread to access the value will pull it to the L2 cache, then to its SM's L1 cache, at which point every thread resident on the SM will attempt to grab it from the L1 cache. Is there any serialization penalty when all threads attempt to access the same L1 cache value?
For Kepler, presumably the first thread to access the value will pull it to the L2 cache (then may or may not pull it to the L1 cache depending on whether L1 caching is enabled). Is there any serialization penalty when all threads attempt to access the same value in L2?
Also, is partition camping a concern?
I found another couple of questions that addressed a similar topic, but not at a level of detail sufficient to satisfy my curiosity.
Thanks in advance!
I have an application where I need to broadcast a single (non-constant, just plain old data) value in global memory to all threads. The threads only need to read the value, not write to it.
As an aside, that is pretty much the definition of constant data, as it pertains to CUDA kernel usage. You may not be able to take advantage of it, but such access is referred to as "uniform" access, and if there is repeated access of such type, for a value that threads only read from and do not write to, then __constant__
memory is a possible optimization that may be considered.
I am wondering how this broadcast takes place under the hood
To be clear, broadcast and/or serialization should only be possible when threads in the same warp are accessing a particular data item. These terms don't apply when threads in different warps are accessing the same location; those will be serviced by separate warp read requests.
Is there any serialization penalty when all threads attempt to access the same L1 cache value?
There is no serialization penalty. Threads in the same warp can read the same location without additional cost; all threads reading from the same location will be serviced in the same cycle ("broadcast"). Threads in separate warps reading the same location on Fermi will be serviced by separate read requests just as you would expect for any instruction executed by separate warps. There is no additional or unusual cost in this case either.
Is there any serialization penalty when all threads attempt to access the same value in L2?
The same statements for L1 above apply for L2 in this case.
Also, is partition camping a concern?
Partition camping has nothing to do with values that are being retrieved from L1 or L2 cache. Partition camping generally refers to a data access pattern that results in DRAM requests that are disproportionately being handled by one of the partitions on a GPU that has multiple memory partitions. For a single location that is being read by multiple threads/warps, the caches will service this. At most, one DRAM transaction should be needed to service all requests that are close enough to each other in time (i.e. ignoring the possibility of cache-thrashing), targetting a single location.