cudagpunvidiagpgpucompute-capability

Understanding Warp Scheduler Utilization in CUDA: Maximum Concurrent Warps vs Resident Warps


In CUDA compute capability 8.6, each Streaming Multiprocessor (SM) has four warp schedulers. Each warp scheduler can schedule up to 16 warps concurrently, meaning that theoretically up to 64 warps could be running concurrently. However, in reality, the maximum number of resident warps per SM is only 48. This presents an inconsistency: doesn't this mean that the scheduling capacity of the warp schedulers will be wasted? Despite the warp schedulers being capable of scheduling 64 warps, in practice there are only 48 warps available for them to schedule. Could anyone clarify this?

UPDATE

Why do I think 'Each warp scheduler can schedule up to 16 warps concurrently, meaning that theoretically up to 64 warps could be running concurrently'? Because in the Ampere Tuning Guide, the documentation states: "The maximum number of concurrent warps per SM remains the same as in Volta (i.e., 64)." Doesn't this imply that each warp scheduler can schedule up to 16 warps concurrently?


Solution

  • As @RobertCrovella points out - your second sentence is incorrect. It is not the case that each warp scheduler "can schedule up to 16 warps".

    Looking at the Ampere microarchitecture white paper or the relevant section the CUDA programming guide (for CC 8.x) - we don't see mention of the number of warps a scheduler handles. We do read, though, that the SM is made up of 4 partitions, each of which having its own scheduler; and that warps are distributed, on reception, "among the schedulers", hence among the partitions. So, it stands to reason to conclude that if an SM can have 48 resident warps, each warp partition (or "processing block") can have up to 12 resident warps, and that's the number each scheduler can handle.

    Part of the mixup may be in that the Ampere Tuning Guide may be referring to the number of resident warps on A100 GPUs (CC 8.0) rather than other Ampere GPUs (with CC 8.6). The former can have up to 64 resident SMs per warp, the latter only 48.