I'd like to process an image with CUDA. Each pixel's new value is calculated based on the two neighboring pixels in one row. Would it make sense to use __shared__
memory for the pixel values, since each value will be used only twice? Aren't tiles also the wrong way to do it, since it doesn't suit the problem structure? My approach would be to run a thread on each pixel and load the neighboring pixel values each time for each thread.
All currently supported CUDA architectures have caches.
From compute capability 3.5 onward these are particularly efficient for read-only data (as read-write data is only cached in L2, the L1 cache is limited to read-only data). If you mark the pointer to the input data as const __restrict__
, the compiler will most likely load it via the L1 texture cache. You can also force this by explicitly using the __ldg()
builtin.
While it is possible to explicitly manage the reuse of data from neighboring pixels via shared memory, you will probably find this to provide no benefit over just relying on the cache.
Of course, whether or not you use shared memory, you want to maximize the block size in x-direction and use a blockDim.y
of 1 for optimal access locality.