[SOLVED] How to a warp cause another warp be in the Idle state?

How to a warp cause another warp be in the Idle state?

As you can see in the title of the question, I want to know how a warp causes another warp go to the Idle state. I read a lot of the Q/A in the SO but I can not find the answer. At any time, just one warp in a block can be run? If so, the idle state of warp has no meaning, but if we can run multiple warps at the same time each warp can do their work separately to other warps.

The paper said: Irregular work-items lead to whole warps to be in idle state (e.g., warp0 w.r.t. warp1 in the following fig).

Solution

The terms used by the Nsight VSE profiler for a warp's state are defined at http://docs.nvidia.com/gameworks/index.html#developertools/desktop/nsight/analysis/report/cudaexperiments/kernellevel/issueefficiency.htm. These terms are also used in numerous GTC presentation on performance analysis.

The compute work distributor (CWD) will launch a thread block on a SM when all resources for the thread block are available. Resources include:

thread block slot
warp slots (sufficient for the block)
registers for each warp
shared memory for the block
barriers for the block

When a SM has sufficient resources the thread block is launched on the SM. The thread block is rasterized into warps. Warps are assigned to warp schedulers. Resources are allocated to each warp. At this point a warp is in an active state meaning that warp can executed instructions.

On each cycle each warp scheduler selects from a list of eligible warps (active, not stalled) and issues 1-2 instructions for the warp. A warp can become stalled for numerous reasons. See the documentation above.

Kepler - Volta GPUs (except GP100) have 4 warps schedulers (subpartitions) per streaming multiprocessor (SM). All warps of a thread blocks must be on the same SM. Therefore, on each given cycle a thread block may issue instructions for up to 4 (subpartition) warps in the thread block.

Each warp scheduler can pick any of the eligible warps each cycle. The SM is pipelined so all warps of a maximum sized thread blocks (1024 threads == 32 warps) can have instructions in flight every cycle.

The only definition of idle that I can determine without additional context are: - If a warp scheduler has 2 eligible warps and 1 is selected then the other is stalled in a state called not selected. - If warps in a thread block execute a barrier (__syncthreads) then the warps will stall on the barrier (not eligible) until the requirements of the barrier are met. The warps are stalled on the barrier.