cudagpgpunvidiagpukepler

Are GPU Kepler CC3.0 processors not only pipelined architecture, but also superscalar?


In the documentation for CUDA 6.5 has written: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz3PIXMTktb

5.2.3. Multiprocessor Level

...

  • 8L for devices of compute capability 3.x since a multiprocessor issues a pair of instructions per warp over one clock cycle for four warps at a time, as mentioned in Compute Capability 3.x.

Does this mean that the GPU Kepler CC3.0 processors are not only pipelined architecture, but also superscalar?

  1. Pipelining - these two sequences execute in parallel (different operations at one time):

    • LOAD [addr1] -> ADD -> STORE [addr1] -> NOP
    • NOP -> LOAD [addr2] -> ADD -> STORE [addr2]
  2. Superscalar - these two sequences execute in parallel (the same operations at one time):

    • LOAD [reg1] -> ADD -> STORE [reg1]
    • LOAD [reg2] -> ADD -> STORE [reg2]

Solution

  • Yes, the warp schedulers in Kepler can schedule two instructions per clock, as long as:

    1. the instructions are independent
    2. the instructions come from the same warp
    3. there are sufficient execution resources in the SM for both instructions

    If that fits your definition of superscalar, then it is superscalar.

    With respect to pipelining, I view pipelining differently. Various execution units in Kepler SM are pipelined. Let's take a floating point multiply as an example.

    In a given clock, a Kepler warp scheduler may schedule a floating point multiply operation on a floating-point unit. The results of this operation may not appear for some number of clocks later, (i.e. they are not available on the next clock cycle) but on the next clock cycle, a new floating point operation can be scheduled on the very same floating point functional units, because the hardware (floating point units, in this case) is pipelined.

    clock    operation    pipeline stage   result
    0           MPY1   ->   PS1
    1                       PS2
    ...                     ...
    N-1                     PSN         ->  result1
    

    on the very next clock after clock 0, a new multiply instruction can be scheduled on the same HW, and the corresponding result will appear on the next cycle after result1 appears.

    Not sure if this is what you meant by "different operations at one time"