python-3.xmemory-managementcudanumbanumba-pro

How fast or slow is the Constant memory that Numba allows a device to allocate, when compared to local and shared memories?


I can't find any clarity as to what is the performance of the so called Constant memory referred to in the Numba documentation:

https://numba.pydata.org/numba-doc/dev/cuda/memory.html#constant-memory

I am curious as to what are the size limits for this memory, how fast/slow it is when compared to other memory types and if there are any pitfalls using it.

Thank you!


Solution

  • This is more of a general question regarding the constant memory in a CUDA-capable device. You can find info in the official CUDA programming guide and here in which it says:

    There is a total of 64 KB constant memory on a device. The constant memory space is cached. As a result, a read from constant memory costs one memory read from device memory only on a cache miss; otherwise, it just costs one read from the constant cache. Accesses to different addresses by threads within a warp are serialized, thus the cost scales linearly with the number of unique addresses read by all threads within a warp. As such, the constant cache is best when threads in the same warp accesses only a few distinct locations. If all threads of a warp access the same location, then constant memory can be as fast as a register access.

    Regarding how this compares to other memory types, here is my short answer. You may want to read this page for further details:

    These memories have different scopes, life-times and usages. The Numba page that you have mentioned in your question explains the basics but the official CUDA programming guide has a lot more details. At the end of the day, the answer to the question of when to use each memory is to a large degree application-dependent.