I use Nvidia visual profiler (nvvp) to perform kernel profiling on cublas kernel. This link Latency Distribution is the latency distribution result.
The document explains the "instruction issued" term in this way - "Instruction Issued - Warp was issued", which makes me confused. What does it actually mean?
First some background about the CUDA execution model.
A CUDA warp is the fundamental unit of scheduling and execution on a CUDA GPU. A warp is a fixed collection of 32 threads that execute together.
Therefore an instruction executed by one thread in the warp is always executed by all other threads in the warp (although they may be predicated off or masked inactive, and ignoring Volta for this discussion), in any given clock cycle.
The CUDA streaming multiprocessor (SM) has schedulers, which look at various threads of execution belonging to the available warps, and select instructions from those threads of execution which are ready, to schedule those instructions on various execution units within the SM.
An instruction issued then, means that the warp scheduler selected an instruction, and issued it (scheduled it) onto a set of execution units, for processing. Saying an "instruction was issued" effectively means, due to the CUDA execution model, that that instruction was issued warp-wide, meaning it was scheduled onto 32 relevant execution units, so as to service that instruction for all 32 threads in the warp. We could say "that warp was issued" meaning that instruction was issued for all 32 threads in the warp.
Now, regarding the distribution piechart, you will probably want to refer to here.
The profiler is using PC-sampling to determine the warp state at the sample points, and then putting the warp state so sampled into a pie chart distribution graph, to show the percentage of time a particular state was sampled.
A warp can be in a variety of states, I'm not going to try to define and summarize them all. But many states will correspond to a "stall" state, meaning that a warp in that state cannot have an instruction issued from it (perhaps, for example, because the next instruction(s) have execution dependencies on previously issued instructions which have not completed yet). The "not stall" state, is "instruction issued". (The warp states are defined here. Technically, "not selected" is a "stall" state, but I will discuss it below).
"instruction issued" is probably the "best" state from the perspective of the warp. At the clock cycle that the warp was sampled, it had an instruction ready to be scheduled and in fact one or more instructions were actually issued from that warp.
by comparison, "not selected" (technically also a "stall" state) is a warp that is "ready" to be issued, but for some reason the warp scheduler chose to select instruction(s) from another warp to issue in the clock cycle that was sampled.