In CUDA, why is performance similar with write-through (with 'volatile') and write-back under light load?

I think under light load, because write back requires writing data to the cache first and waiting for it to be flushed to the global memory, this mode should cause performance waste because there will be a certain lag when data is flushed out. On the contrary, write through allows data to be directly updated to global memory, which should result in better performance.

Here is my understanding:

My test results on 1050ti and A100 are the same.(Using the nsight system to calculate time consumption)

The following is the test kernel, the code is simple, just a simple test for writing bandwidth.(Light load)

    __global__ voild cuda_peek(volatile uint *a)
    {
        uint gidx=(blockIdx.x  * blockDim.x + thread.x );
        a[gidx] = gidx+10;
    }

    int block_num=6*32; //6 is the number of sm for 1050ti
    int loop=100;
    int block_size=1024;
    cudaEvent_t start,stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    dim3 gridDim,blockDim;
    gridDim.x=block_num;
    gridDim.y=1;
    gridDim.z=1;
    blockDim.x=block_size;
    blockDim.y=1;
    blockDim.z=1;
    
    cuda_peek<<<gridDim,blockDim>>>(A);
    cudaEventRecord(start);
    for(uint i=0;i<loop;i++)
    {
    cuda_peek<<<gridDim,blockDim>>>(A);
    }
   cudaEventRecord(stop);
  
    cudaEventSynchronize...
    cudaMemcpy...

Solution

Why does CUDA show similar performance etc. ?

If for no other reason - because you're micro-benchmarking wrong.

Your code will yield no meaningful result because the overheads are very high relative to the work; and just the performance "noise" might be of a higher order of magnitude than what you're testing. The fact that you schedule your kernel many times may help the "perf noise" somewhat - but relative to a mean which is meaningless (pun not intended).

To test (performance) behavior of some sequence of instructions:

Perform them in many threads per block (unless it's a block-level phenomenon).
Perform them in many many blocks (several times more than the GPU can have "in-flight" at a given time).
Run the sequence many times in a loop, in each thread (so that the thread setup overhead is small/negligible)