Issued load/store instructions for replay

There are two nvprof metrics regarding load/store instructions and they are ldst_executed and ldst_issued. We know that executed<=issued. I expect that those load/stores that are issued but not executed are related to branch predications and other incorrent predictions. However, from this (slide 9) document and this topic, instructions that are issued but not executed are related to serialization and replay.

I don't know if that reason applies for load/store instructions or not. Moreover, I would like to know why such terminology is used for issued but not executed instructions? If there is a serialization for any reason, instructions are executed multiple times. So, why they are not counted as executed?

Any explanation for that?

Solution

The NVIDIA architecture optimized memory throughput by issuing an instruction for a group of threads called a warp. If each thread accesses a consecutive data element or the same element then the access can be performed very efficiently. However, if each thread accesses data in a different cache line or at a different address in the same bank then there is a conflict and the instruction has to be replayed.

inst_executed is the count of instructions retired. inst_issued is the count of instructions issued. An instruction may be issued multiple times in the case of a vector memory access, memory address conflict, memory bank conflict, etc. On each issue the thread mask is reduced until all threads have completed.

The distinction is made for two reasons: 1. Retirement of an instruction indicates completion of a data dependency. The data dependency is only resolved 1 time despite possible replays. 2. The ratio between issued and executed is a simple way to show opportunities to save warp scheduler issue cycles.

In Fermi and Kepler SM if a memory conflict was encountered then the instruction was replayed (re-issued) until all threads completed. This was performed by the warp scheduler. These replays consume issue cycles reducing the ability for the SM to issue instructions to math pipes. In this SM issued > executed indicates an opportunity for optimization especially if issued IPC is high.

In the Maxwell-Turing SM replays for vector accesses, address conflicts, and memory conflicts are replayed by the memory unit (shared memory, L1, etc.) and do not steal warp scheduler issue cycles. In this SM issued is very seldom more than a few % above executed.

EXAMPLE: A kernel loads a 32-bit value. All 32 threads in the warp are active and each thread accesses a unique cache line (stride = 128 bytes).

On Kepler (CC3.*) SM the instruction is issued 1 time then replayed 31 additional times as the Kepler L1 can only perform 1 tag lookup per request.

inst_executed = 1 inst_issued = 32

On Kepler the instruction has to be replayed again for each request that missed in the L1. If all threads miss in the L1 cache then

inst_executed = 1 inst_issued >= 64 = 32 request + 32 replays for misses

On Maxwell - Turing architecture the replay is performed by the SM memory system. The replays can limit memory throughput but will not block the warp scheduler from issuing instructions to the math pipe.

inst_executed = 1 inst_issued = 1

On Maxwell-Turing Nsight Compute/Perfworks expose throughput counters for each of the memory pipelines including number of cycles due to memory bank conflicts, serialization of atomics, address divergence, etc.