I have completed writing my CUDA kernel, and confirmed it runs as expected when I compile it using nvcc directly, by:
Yet, the results printed into the terminal while the application is getting profiled using Nsight Compute differs from run to run. I am curious if the difference is a cause for concern, or if this is the expected behavior.
Note: The application also gives correct & consistent results while getting profiled bu nvprof.
I was able to resolve the issue by addressing my shared memory initializations. Since Nsight Compute runs a kernel multiple times as @Jackson stated, the effects of uninitialized memory were amplified (I was performing atomicAdd into uninitialized memory).