Using the .idx
option of shfl.sync
, it is possible to arbitrarily permute registers between threads in a single warp. The hope is that by using shfl.sync
, you can avoid storing and then loading data from shared memory.
However, according to this HN comment, it sounds like shfl.sync
will sometimes go through shared memory (or some other slow path), depending on the specific shuffle configuration:
In CUDA PTX, there is an instruction called "shfl.sync.idx", which allows you to shuffle data between threads arbitrarily. However, this shuffling is only efficient if you follow a variety of rules: avoiding bank conflicts and so forth.
I looked around online, but couldn't find any resources which lay out exactly which shuffles are fast.
When using shfl.sync
with .idx
, is it always fast? Does it always go through registers? If not, what are the rules I need to follow if I want shfl.sync
with .idx
to hit the fast path?
TL;DR: All shuffles are blazingly fast.
The shfl
instructions perform a gather operation between different threads in a warp.
The way this works is that (A) every thread exposes its own 32-bit value and (B) picks one of the 32 values from the menu exposed in A.
This always works via registers, shared memory is not touched.
The drawback you can only specify the source in idx
, not the destination. This means you cannot easily do a scatter operation, only a gather.
The idx
variant is the most general, and every thread can specify a different index.
You can easily time the variants using the following code:
__global__ void time_shfl(int i) { //parameter to stop the optimizer from eliminating code
for (auto r = 0; r < 5; r++) { //disregard the first time.
const auto a = i * threadIdx.x;
const auto source = 31 - threadIdx.x; //reverse the order
const auto StartTime = clock64();
const auto b = __shfl_sync(-1u, a, source, 32);
const auto EndTime = clock64();
printf("shfl_sync: T:%i, a = %i, b = %i, time = %i\n", threadIdx.x, a, b, int(EndTime - StartTime));
}
}
int main() {
time_shfl<<<1, 32>>>(10);
cudaDeviceSynchronize();
}
Timings:
shfl_sync: T:0, a = 0, b = 310, time = 510
...
shfl_sync: T:31, a = 310, b = 0, time = 510
What? 510 cycles, oh wait that includes startup time (fetching the uncached i
parameter from main memory) for some reason, you can only run these micro-benchmarks in a loop.
shfl_sync: T:26, a = 260, b = 50, time = 7
shfl_sync: T:27, a = 270, b = 40, time = 7
...
shfl_sync: T:30, a = 300, b = 10, time = 7
shfl_sync: T:31, a = 310, b = 0, time = 7
This takes 7 cycles, which includes 2 cycles for the clock64 call. Ergo a 5 cycles for the shfl.idx.sync
If you time shfl_xor
you'll see: it runs just a fast, but is does more work, in the shfl_idx
you'll need to calculate the index (which we should include in our timings, but didn't for comparisions' sake).
shfl_xor_sync: T:24, a = 240, b = 70, time = 7
...
shfl_xor_sync: T:30, a = 300, b = 10, time = 7
shfl_xor_sync: T:31, a = 310, b = 0, time = 7
And yes, the high level instructions map directly onto the ptx assembly instructions, except that you cannot access the predicate register that the ptx shfl.*
variants return.
However this low cycle throughput is not the whole story, these instructions also have a latency, it takes some time for the result to become available, this we can measure by using the output of the first shuffle as input to another shuffle and then subtracting the 10 cycles the shuffles themselves take.
const auto StartTime = clock64();
const auto b = __shfl_sync(-1u, a, source, 32);
const auto c = __shfl_sync(-1u, b, source, 8); //the params do not affect timing
const auto EndTime = clock64();
This gives us 30-10 = 20 cycle latency -yikes-.
You should really try to interleave multiple independent pieces of work if latency is something you worry about. The GPU will schedule other warps whilst waiting in between these 2 statements (if the multiprocessor has > 64 threads running on it).
For some applications (most of my code, sigh*) the latency is much more important than the throughput, you'll need to keep an eye on both most of the time.
shfl_sync: T:26, a = 260, b = 50, c = 20, time = 30
...
shfl_sync: T:30, a = 300, b = 10, c = 60, time = 30
shfl_sync: T:31, a = 310, b = 0, c = 70, time = 30
CUDA will never use shared memory, unless you declare a __shared__
variable. Cooperative groups is a library that uses __shared__
variables. On the Ampere GPUs a section of 1KB of shared memory is reserved, but that has nothing to do with this context.
If you are wondering, shared memory is not much slower:
__shared__ int volatile shuffle[32]; //volatile allows unsafe speed hacks
const auto StartTime = GetGlobalClock();
shuffle[threadIdx.x] = a;
//without __syncwarp() this will give incorrect results if threads are not in lockstep!
//__syncwarp(); //in the same warp any *sync instruction will make this safe.
const auto b = shuffle[source];
const auto EndTime = GetGlobalClock();
One extra cycle for a load + store. Note that on the RTX 3070 (which I used for these tests), there are only 16 integer cores available per warp, meaning that the minimum time any instruction can take is 2 cycles.
shared shuffle: T:26, a = 260, b = 50, time = 8
...
shared shuffle: T:30, a = 300, b = 10, time = 8
shared shuffle: T:31, a = 310, b = 0, time = 8
However, this code will only work correctly if all threads in the warp are in sync, if not you'll need to add a __syncwarp()
before the b = shuffle[
statement.
shared shuffle: T:29, a = 290, b = 20, time = 21
shared shuffle: T:30, a = 300, b = 10, time = 21
shared shuffle: T:31, a = 310, b = 0, time = 21
The shfl.sync
will always sync the threads, so for possibly unsynced warp it is much faster.
Long story short, time your code.