cudagpukepler

*Modified* Nvidia Maxwell, increased global memory instruction count


I had an experiment on both GTX760(Kepler) and GTX750Ti(Maxwell) using benchmarks(Parboil, Rodinia). Then I analyzed results using Nvidia visual profiler. In most of the applications, the number of global instructions are enormously increased up to 7-10 times on Maxwell architecture.

spec. for both graphic cards

GTX760 6.0Gbps 2048MB 256bit 192.2 GB/s

GTX750Ti 5.4Gbps 2048MB 128bit 86.4Gb/s

Ubuntu 14.04

CUDA driver 340.29

toolkit 6.5

I compiled the benchmark application(No modification) then I collected the results from NVVP(6.5). Analyze all > Kernel Memory > From L1/Shared Memory section, I collected global load transaction counts.

I attached screenshots of our simulation result of histo ran on kepler(link) and maxwell(link)

Anyone know why the number of global instruction counts are increased on Maxwell architecture?

Thank you.


Solution

  • The counter gld_transactions is not comparable between Kepler and Maxwell architecture. Furthermore, this is not equivalent to the count of global instructions executed.

    On Fermi/Kepler this counts the number of SM to L1 128 byte requests. This can increment from 0-32 per global/generic instruction executed.

    On Maxwell global operations all go through the TEX (unified cache). The TEX cache is completely different from the Fermi/Kepler L1 cache. Global transactions measure the number of 32B sectors accessed in the cache. This can increment from 0-32 per global/generic instruction executed.

    If we look at 3 different cases:

    CASE 1: Each thread in a warp accesses the same 32-bit offset.

    CASE 2: Each thread in a warp accesses a 32-bit offset with a 128 byte stride.

    CASE 3: Each thread in a warp accesses a unique 32-bit offset based upon its lane index.

    CASE 4: Each thread in a warp accesses a unique 32-bit offset in a 128 byte memory range that is 128-byte aligned.

    gld_transcations for each list case by architecture

                Kepler      Maxwell
    Case 1      1           4
    Case 2      32          32
    Case 3      1           8
    Case 4      1           4-16
    

    My recommendation is to avoid looking at gld_transactions. A future version of the CUDA profilers should use different metrics that are more actionable and comparable to past architectures.

    I would recommend looking at l2_{read, write}_{transactions, throughput}.