Tensor cores can be programmatically accessed through the WMMA interface in CUDA (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma and https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/) . Recently, in the Ampere generation of cards, Nvidia announced the ability to perform sparse tensor operations with sparse matrices, as seen here: https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/
The format presented appears to take in pairs of elements and their order within four element segments (2 bit indices). However looking at the wmma documentation I can't find any mention of this, or how to access those special tensor core operations. This is not illuminated by the announcement page of this functionality either AFAICT.
How do I access sparse tensor core functionality in cuda?
The blog post in your question links to the following paper: Accelerating Sparse Deep Neural Networks https://arxiv.org/pdf/2104.08378.pdf
In Section 3.2 it says
It is the application’s responsibility to ensure that the first operand is a matrix stored in the compressed 2:4 format. cuSPARSELt and other libraries provide APIs for compression and sparse math operations, while, starting in version 8.0, the TensorRT SDK performs these functions for 2:4 sparse weights automatically. NVIDIA libraries require that input dimensions of a sparse matrix multiplication be multiples of 16 and 32 for 16-bit (FP16/BF16) and 8b-integer formats, respectively.
Sparse tensor operations can manually be performed using ptx mma.sp
which is explained in the ptx documentation Section 9.7.13.5 : https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-sparse-mma