I had an experiment on both GTX760(Kepler) and GTX750Ti(Maxwell) using benchmarks(Parboil, Rodinia). Then I analyzed results using Nvidia visual profiler. In most of the applications, the number of global instructions are enormously increased up to 7-10 times on Maxwell architecture.
spec. for both graphic cards
GTX760 6.0Gbps 2048MB 256bit 192.2 GB/s
GTX750Ti 5.4Gbps 2048MB 128bit 86.4Gb/s
Ubuntu 14.04
CUDA driver 340.29
toolkit 6.5
I compiled the benchmark application(No modification) then I collected the results from NVVP(6.5). Analyze all > Kernel Memory > From L1/Shared Memory section, I collected global load transaction counts.
I attached screenshots of our simulation result of histo ran on kepler(link) and maxwell(link)
Anyone know why the number of global instruction counts are increased on Maxwell architecture?
Thank you.
The counter gld_transactions is not comparable between Kepler and Maxwell architecture. Furthermore, this is not equivalent to the count of global instructions executed.
On Fermi/Kepler this counts the number of SM to L1 128 byte requests. This can increment from 0-32 per global/generic instruction executed.
On Maxwell global operations all go through the TEX (unified cache). The TEX cache is completely different from the Fermi/Kepler L1 cache. Global transactions measure the number of 32B sectors accessed in the cache. This can increment from 0-32 per global/generic instruction executed.
If we look at 3 different cases:
CASE 1: Each thread in a warp accesses the same 32-bit offset.
CASE 2: Each thread in a warp accesses a 32-bit offset with a 128 byte stride.
CASE 3: Each thread in a warp accesses a unique 32-bit offset based upon its lane index.
CASE 4: Each thread in a warp accesses a unique 32-bit offset in a 128 byte memory range that is 128-byte aligned.
gld_transcations for each list case by architecture
Kepler Maxwell
Case 1 1 4
Case 2 32 32
Case 3 1 8
Case 4 1 4-16
My recommendation is to avoid looking at gld_transactions. A future version of the CUDA profilers should use different metrics that are more actionable and comparable to past architectures.
I would recommend looking at l2_{read, write}_{transactions, throughput}.