These are my assumptions:
Assuming Compute Capability 7.5, these are my questions:
CC 6.x/7.x
In Nsight Compute the term requests varies between 6.x and 7.x.
Answering your CC 7.5 Questions
- The third assumption seems to imply that L2->TEX Returns should always be a multiple of four for global cached loads, but that's not always the case. What is happening here?
The L1TEX unit will only fetch the missed 32B sectors in a cache line.
- Is there still a point in marking pointers with const and restrict qualifiers? That used to be a hint to the compiler that the data is read-only and therefore can be cached in L1/texture cache, but now all data is cached there, both read-only and not read-only.
The compiler can perform additional optimizations if the data is known to be read-only.
- From my fourth assumption, I would think that whenever TEX->SM Returns is greater than L2->TEX Returns, the difference comes from cache hits. That's because when there's a cache hit, you get some sectors read from L1, but none from L2. Is this true?
L1TEX to SM return B/W is 128B/cycle. L2 to SM return B/W is in 32B sectors.
The Nsight Compute Memory Workload Analysis | L1/TEX Cache table shows