I can't find any clarity as to what is the performance of the so called Constant memory referred to in the Numba documentation:
https://numba.pydata.org/numba-doc/dev/cuda/memory.html#constant-memory
I am curious as to what are the size limits for this memory, how fast/slow it is when compared to other memory types and if there are any pitfalls using it.
Thank you!
This is more of a general question regarding the constant memory in a CUDA-capable device. You can find info in the official CUDA programming guide and here in which it says:
There is a total of 64 KB constant memory on a device. The constant memory space is cached. As a result, a read from constant memory costs one memory read from device memory only on a cache miss; otherwise, it just costs one read from the constant cache. Accesses to different addresses by threads within a warp are serialized, thus the cost scales linearly with the number of unique addresses read by all threads within a warp. As such, the constant cache is best when threads in the same warp accesses only a few distinct locations. If all threads of a warp access the same location, then constant memory can be as fast as a register access.
Regarding how this compares to other memory types, here is my short answer. You may want to read this page for further details:
Registers: Thread private on-chip read + write memory which can be considered as the fastest memory space on a GPU.
Local memory: Thread private off-chip read + write memory which, despite its misleading name, is physically the same location as global memory. Hence, its high latency.
Global memory: The largest memory with a high latency and a global scope which is also off-chip with read + write permissions.
Constant memory: Off-chip cached read-only memory limited to 64 KB which could be accessed by threads as fast as registers, if all threads of a warp access the same location.
Shared memory: On-chip, low-latency, read + write with limited space per multiprocessor (48 KB to 164 KB depending on the compute capability of your device).
Texture memory: On-chip cached read-only memory optimized for 2D spatial locality that supports unique features like hardware filtering.
Pinned (page-locked) memory: Not an explicit device memory. Accessible directly by both CPU and GPU codes, used to maximize and overlap data transfer between CPU/GPU.
These memories have different scopes, life-times and usages. The Numba page that you have mentioned in your question explains the basics but the official CUDA programming guide has a lot more details. At the end of the day, the answer to the question of when to use each memory is to a large degree application-dependent.