[SOLVED] How am I able to run Tensor Core instructions without actually having Tensor Cores?

How am I able to run Tensor Core instructions without actually having Tensor Cores?

I'm using CUDA's WMMA API to multiply fragments on the GTX 1660 Ti. This GPU doesn't have Tensor Cores, but when I look at the SASS generated for my code I see HMMA.1688.F32 instructions, which are Tensor Core instructions! How can that happen?

Relevant information:

NVIDIA confirming my card doesn't have Tensor Cores: https://www.nvidia.com/en-eu/geforce/10-series/ (Technology Features table comparing GTX 10, GTX 16 and RTX 20 Series).
HMMA.1688.F32 instructions linked to Tensor Core units:
- https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9839-discovering-the-turing-t4-gpu-architecture-with-microbenchmarks.pdf
- https://ieeexplore.ieee.org/document/9139835 (account required to access, but more detailed)

Solution

For code binary compatibility, the "non-tensor-core" members of the Turing family have hardware in the SM that will process tensor core instructions, albeit at a relatively low throughput, compared to a tensor core unit. This applies to any GPU variant (e.g. GeForce, Quadro) that is derived from or based on the TU116 or TU117 GPUs.