Nvidia Ampere GPU support feature: cp async from global mem to shared mem bypass L1 and register file.
The corresponding PTX core is cp.async
.
Why need __cvta_generic_to_shared
to convert a shared mem ptr (T*) to size_t
?
size_t smem_a_converted = __cvta_generic_to_shared(smem_a);
if(tid == 0 && bx == 0 && by == 0 && bz == 0){
printf("addr check: %x, %x \n", smem_a, smem_a_converted);
}
Output: addr check: 1c001100, 7f92
At the PTX or machine code (SASS) level, CUDA GPUs have a system of state spaces, which is a partitioned addressing structure. shared
is a state space.
An access to one of these state spaces can happen using either of two types of pointers: one that is decorated and appropriate for that space, and the other being a "generic" pointer. A generic pointer does not carry any state or additional information to declare what state-space it belongs to.
In CUDA C++, like C++, pointers are pointers. They have no additional decoration or meta information.
AFAIK, (although the conversion may be happening somewhere) the programmer need not worry about making the conversion explicitly when doing async copes using the C++-exposed methods.
However, the particular machine or PTX instructions that do this (async copy from global to shared) require a shared state-space pointer, not a generic pointer.
Therefore it is necessary to convert the C++-style generic pointer to the type that is needed for this instruction, if you were to extract the pointer from C++ but then issue the instruction via PTX, for example. The __cvta_generic_to_shared()
method is one possible way to make this conversion in that case.
A state-space pointer in PTX or SASS has no real analog in C++, therefore the return type of size_t
is used to identify the 64-bit quantity. One might hesitate to use a C++ pointer quantity for it, because that would place it logically back into the "generic" realm. There is no logical sense in which that pointer could be dereferenced in C++ to perform anything useful.