I got error from the following code when I intended to use ldmatrix
and mma
instruction. PTX Docu says that 'ldmatrix' is introduced in PTX 6.5. So I doubt the the PTX version could be one reason. I'd like to know how do we find out which PTX version are we using? What other possibilities could be the reason for those errors?
__device__ void
runldmatrix(typet & D, unsigned addr){
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 750))
int x, y, z, w;
asm volatile (
"ldmatrix.sync.aligned.x4.m8n8.shared.b16 {%0, %1, %2, %3}, [%4];"
: "=r"(x), "=r"(y), "=r"(z), "=r"(w)
: "r"(addr));
reinterpret_cast<int4 &>(D) = make_int4(x, y, z, w);
#else
assert(0);
#endif
}
__device__ void
runmma(typet & d, typet const & a,
typet const & b,typet const & c ){
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 750))
unsigned const *A = reinterpret_cast<unsigned const *>(&a);
unsigned const & B = reinterpret_cast<unsigned const &>(b);
unsigned const *C = reinterpret_cast<unsigned const *>(&c);
unsigned *D = reinterpret_cast<unsigned *>(&d);
asm volatile(
"mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 {%0,%1}, {%2,%3}, {%4}, {%5,%6};\n"
: "=r"(D[0]), "=r"(D[1])
: "r"(A[0]), "r"(A[1]), "r"(B), "r"(C[0]), "r"(C[1]));
#endif
}
/tmp/tmpxft_00002eb6_00000000-5_test_gemm.ptx, line 2637; error : Unknown modifier '.x4' ptxas
/tmp/tmpxft_00002eb6_00000000-5_test_gemm.ptx, line 2637; error : Unknown modifier '.m8n8' ptxas
/tmp/tmpxft_00002eb6_00000000-5_test_gemm.ptx, line 2637; error : Not a name of any known instruction: 'ldmatrix'
ptxas /tmp/tmpxft_000
02ee1_00000000-5_test_gemm.ptx, line 2611; error : Unknown modifier '.m16n8k8' ptxas /tmp/tmpxft_00002ee1_00000000-5_test_gemm.ptx, line 2611; error : Shape modifier required for instruction 'mma'
update:
I'm using 2080 Ti with CUDA 10.1, with the following cmake to ensure the computation ability of 7.5
cmake_minimum_required(VERSION 3.18)
project(Hello)
enable_language(CUDA)
add_executable(gunne test_gemm.cu)
target_include_directories(gunne PRIVATE include)
set_property(TARGET gunne PROPERTY CUDA_ARCHITECTURES 75)
Docu says that 'ldmatrix' is introduced in PTX 6.5. So I doubt the the PTX version could be one reason.
What other possibilities could be the reason for those errors?
Actually, that is the reason. CUDA 10.1 (the most recent version of it) included PTX version 6.4.
If you search the PTX manual that shipped with that version of CUDA, there is no ldmatrix
instruction.
Furthermore, if we look at the relevant section we observe that there is no m16n8k8
variant in that PTX version, for the mma.sync.aligned
instruction.
It seems that all of your compile errors collapse down to these issues. When I provide a definition for typet
(and #include <cassert>
) the code will compile for me on e.g. CUDA 11.4
I'd like to know how do we find out which PTX version are we using?
I can think of at least a couple ways, there are probably others.
An "offline" method is: Assuming you are using a CUDA Version of 8.0 or newer, go to the cuda docs page, select the PTX manual, then notice at the top the notation:
PTX ISA (PDF) - v11.5.1 (older)
Click the older link, and it will take you to a page where you can select the versioned online documentation that corresponds to your CUDA version. Then select the PTX manual there, and it will indicate what version it is for.
An alternative method would be to compile any CUDA code using your toolchain to PTX (e.g. nvcc my_kernel_code.cu --ptx
and study the resultant ptx file. Near the top it will have a notation like:
.version 7.4
This will tell you the PTX version your toolchain is generating.
I'm not suggesting your code is otherwise correct, merely that it can/will compile when using a suitable toolchain. You haven't provided a complete code, nor stated your intent, so I see no real point in going beyond this, but the usage of int
variables with the .b16
instruction doesn't make much sense to me. It appears to compile, however.