Making some, but not all, CUDA memory accesses uncached
I just noticed it's at all possible to have (CUDA kernel) memory accesses uncached (see e.g. this answer here on SO).
Can this be done...
For a single kernel individually?
At run time rather than at compile time?
For writes only rather than for reads and writes?
Solution
Only if you compile that kernel individually, because this is an instruction level feature which is enabled by code generation. You could also use inline PTX assembler to issue ld.global.cg instructions for a particular load operation within a kernel [see here for details].
No, it is an instruction level feature of PTX. You can JIT a version of code containing non-caching memory loads at runtime, but that is still technically compilation. You could probably use some template tricks and separate compilation to get the runtime to hold two versions of the same code built with or without caching and choose between those versions at runtime. You could also use the same tricks to get two versions of a given kernel without or without inline PTX for uncached loads [see here for one possibility of achieving this]
These non-caching instructions bypass the L1 cache with byte level granularity to L2 cache. So they are load only (all writes invalidate L1 cache and store to L2).