cachingcudagpgpunsightcompute-capability

Cache behaviour in Compute Capability 7.5


These are my assumptions:

  1. There are two types of loads, cached and uncached. In the first one, the traffic goes through L1 and L2, while in the second one, the traffic goes only through L2.
  2. The default behaviour in Compute Capability 6.x and 7.x are cached accesses.
  3. A L1 cache line is 128 bytes and a L2 cache line is 32 bytes, so for every L1 transaction generated, there should be four L2 transactions (one per each sector.)
  4. In Nsight, a SM->TEX Request means a warp-level instruction merged from 32 threads. L2->TEX Returns and TEX->SM Returns is a measure of how many sectors are transfered between each memory unit.

Assuming Compute Capability 7.5, these are my questions:

  1. The third assumption seems to imply that L2->TEX Returns should always be a multiple of four for global cached loads, but that's not always the case. What is happening here?
  2. Is there still a point in marking pointers with const and __restrict__ qualifiers? That used to be a hint to the compiler that the data is read-only and therefore can be cached in L1/texture cache, but now all data is cached there, both read-only and not read-only.
  3. From my fourth assumption, I would think that whenever TEX->SM Returns is greater than L2->TEX Returns, the difference comes from cache hits. That's because when there's a cache hit, you get some sectors read from L1, but none from L2. Is this true?

Solution

  • CC 6.x/7.x

    In Nsight Compute the term requests varies between 6.x and 7.x.

    Answering your CC 7.5 Questions

    1. The third assumption seems to imply that L2->TEX Returns should always be a multiple of four for global cached loads, but that's not always the case. What is happening here?

    The L1TEX unit will only fetch the missed 32B sectors in a cache line.

    1. Is there still a point in marking pointers with const and restrict qualifiers? That used to be a hint to the compiler that the data is read-only and therefore can be cached in L1/texture cache, but now all data is cached there, both read-only and not read-only.

    The compiler can perform additional optimizations if the data is known to be read-only.

    1. From my fourth assumption, I would think that whenever TEX->SM Returns is greater than L2->TEX Returns, the difference comes from cache hits. That's because when there's a cache hit, you get some sectors read from L1, but none from L2. Is this true?

    L1TEX to SM return B/W is 128B/cycle. L2 to SM return B/W is in 32B sectors.

    The Nsight Compute Memory Workload Analysis | L1/TEX Cache table shows