According to the GK110 whitepaper, each SMX has a maximum of 64 warps and a maximum thread capacity of 2048 threads.
My question is this: Does each SMX always operate at this maximum resident warp number of 64 (assuming no thread divergence and a block size that is a multiple of 64)?
I have reason to believe that if your number of threads on an SMX < 1024, you will only get a maximum of 32 warps per multiprocessor.
(I believe this because my similarly clocked Fermi card is showing similar speeds to my Kepler card when the number of threads is 1024 on 1 block when running the same code)
My question is this: does each SMX always operate at this maximum warp rate of 64 (assuming no thread divergence and a block size that is a multiple of 64)?
64 warps per SMX is the maximum number of warps that can be available and ready to be scheduled. It does not mean that all 64 warps are executing simultaneously. The GK110 SMX has 4 warp schedulers, each of which can schedule 1 or 2 instructions from a warp. So in any instruction cycle/issue slot, at most 4 warps will be "scheduled" to have their instruction(s) begin in that slot.
Since threads are scheduled in blocks of 32 called warps, it's of course axiomatic that if you have fewer than 1024 threads in flight, you also probably have fewer than 32 warps in flight.
Both fermi and kepler are limited to 1024 threads per block. So the fermi limit of 1536 theads per SM, and the kepler limit of 2048 threads per SMX, is achieved by having multiple threadblocks open simultaneously on a given SM/SMX. Schedulable warps can come from any open threadblock on the SM/SMX.