I read NVIDIA Fermi whitepaper and get confused when I calculated the number of SP cores, schedulers.
According to the whitepaper, in each SM, there are two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. There are 32 SP cores in a SM, each core has a fully pipelined ALU and FPU, which is used to execute the instruction of a thread
As we all know, a warp is made up by 32 threads, if we just issue a warp each cycle, that means all threads in this warp will occupy all SP cores and will finish the execution in one cycle(suppose there is no any stall).
However, NVIDIA devise dual scheduler, which select two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs.
NVIDIA said this design lead to peak hardware performance. Maybe the peak hardware performance comes from interleaving execution of different instruction, taking full advantage of hardware resources.
My questions are as follows(suppose no memory stalls and all operands are available):
Does each warp need two cycles to finish execution and all 32 SP cores are divided into two groups for each warp scheduler?
the ld/st and SFU units are shared by all the warps(looks like uniform for warps from dual schedulers)?
if a warp is divided into two parts, which part is scheduled first? is there any scheduler? or just random selects one part to execute.
what is the advantage of this design? just maximize the utilization of hardware?
Does each warp need two cycles to finish execution and all 32 SP cores are divided into two groups for each warp scheduler?
Yes. Fermi, unlike future generations, has a "hotclock" (shader clock) which runs at 2x the "core" clock. Each single precision floating point instruction (for example) issues over 2 "hotclocks", but to the same group of 16 SP cores. The net effect is one issue per "core" clock per scheduler.
the ld/st and SFU units are shared by all the warps(looks like uniform for warps from dual schedulers)?
Don't really understand the question. All execution resources are shared/available for instructions coming from either scheduler.
if a warp is divided into two parts, which part is scheduled first? is there any scheduler? or just random selects one part to execute.
Why does this matter? The machine behaves as if two complete warp instructions are scheduled in one core clock i.e. "dual issue". You don't have visibility into anything happening at the hotclock level anyway.
what is the advantage of this design? just maximize the utilization of hardware?
Yes, as stated in the fermi whitepaper:
" Using this elegant model of dual-issue, Fermi achieves near peak hardware performance. "