Error when compile cuda with ptx instruction 'ldmatrix' and 'mma'

I got error from the following code when I intended to use ldmatrix and mma instruction. PTX Docu says that 'ldmatrix' is introduced in PTX 6.5. So I doubt the the PTX version could be one reason. I'd like to know how do we find out which PTX version are we using? What other possibilities could be the reason for those errors?

    __device__ void
    runldmatrix(typet & D, unsigned addr){
      #if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 750))
        int x, y, z, w;            
        asm volatile (
          "ldmatrix.sync.aligned.x4.m8n8.shared.b16 {%0, %1, %2, %3}, [%4];" 
          : "=r"(x), "=r"(y), "=r"(z), "=r"(w) 
          : "r"(addr));             
        reinterpret_cast<int4 &>(D) = make_int4(x, y, z, w);
      #else
        assert(0);
      #endif
    }

    __device__  void
    runmma(typet & d, typet const & a, 
                typet  const & b,typet const & c ){
                  
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 750))                  
        unsigned const *A = reinterpret_cast<unsigned const *>(&a);
        unsigned const & B = reinterpret_cast<unsigned const &>(b);
        unsigned const *C = reinterpret_cast<unsigned const *>(&c);
        unsigned *D = reinterpret_cast<unsigned *>(&d);
        
          asm volatile(
              "mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 {%0,%1}, {%2,%3}, {%4}, {%5,%6};\n"
              : "=r"(D[0]), "=r"(D[1])
              : "r"(A[0]), "r"(A[1]), "r"(B), "r"(C[0]), "r"(C[1]));
#endif    
    }

/tmp/tmpxft_00002eb6_00000000-5_test_gemm.ptx, line 2637; error : Unknown modifier '.x4' ptxas

/tmp/tmpxft_00002eb6_00000000-5_test_gemm.ptx, line 2637; error : Unknown modifier '.m8n8' ptxas

/tmp/tmpxft_00002eb6_00000000-5_test_gemm.ptx, line 2637; error : Not a name of any known instruction: 'ldmatrix'

ptxas /tmp/tmpxft_000

02ee1_00000000-5_test_gemm.ptx, line 2611; error : Unknown modifier '.m16n8k8' ptxas /tmp/tmpxft_00002ee1_00000000-5_test_gemm.ptx, line 2611; error : Shape modifier required for instruction 'mma'

update:

I'm using 2080 Ti with CUDA 10.1, with the following cmake to ensure the computation ability of 7.5

cmake_minimum_required(VERSION 3.18)
project(Hello)
enable_language(CUDA)

add_executable(gunne test_gemm.cu)
target_include_directories(gunne PRIVATE include)
set_property(TARGET gunne PROPERTY CUDA_ARCHITECTURES 75)

Solution

Docu says that 'ldmatrix' is introduced in PTX 6.5. So I doubt the the PTX version could be one reason.

What other possibilities could be the reason for those errors?

Actually, that is the reason. CUDA 10.1 (the most recent version of it) included PTX version 6.4.

If you search the PTX manual that shipped with that version of CUDA, there is no ldmatrix instruction.

Furthermore, if we look at the relevant section we observe that there is no m16n8k8 variant in that PTX version, for the mma.sync.aligned instruction.

It seems that all of your compile errors collapse down to these issues. When I provide a definition for typet (and #include <cassert>) the code will compile for me on e.g. CUDA 11.4

I'd like to know how do we find out which PTX version are we using?

I can think of at least a couple ways, there are probably others.

An "offline" method is: Assuming you are using a CUDA Version of 8.0 or newer, go to the cuda docs page, select the PTX manual, then notice at the top the notation:

PTX ISA (PDF) - v11.5.1 (older)

Click the older link, and it will take you to a page where you can select the versioned online documentation that corresponds to your CUDA version. Then select the PTX manual there, and it will indicate what version it is for.

An alternative method would be to compile any CUDA code using your toolchain to PTX (e.g. nvcc my_kernel_code.cu --ptx and study the resultant ptx file. Near the top it will have a notation like:

.version 7.4

This will tell you the PTX version your toolchain is generating.

I'm not suggesting your code is otherwise correct, merely that it can/will compile when using a suitable toolchain. You haven't provided a complete code, nor stated your intent, so I see no real point in going beyond this, but the usage of int variables with the .b16 instruction doesn't make much sense to me. It appears to compile, however.