As I understand, the Nvidia tensor cores multiplies two 4x4 matrices and adds the result to a third matrix. Multiplying two 4x4 matrices produces a 4x4 matrix, and adding two 4x4 matrices produces a 4x4 matrix. Still "Each Tensor Core provides a 4x4x4 matrix processing array".
There are 4x multiplication-accumulate operations that are needed for each row*col. I thought maybe the last x4 comes from intermediate result before the accumulation, but I don't think it quite fits with the description on Nvidias pages.
"The FP16 multiply results in a full precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply, as Figure 9 shows." https://developer.nvidia.com/blog/cuda-9-features-revealed/
4x4x4 matrix multiply? I thought matrices was 2dimensions by definition.
Can someone please explain where the last x4 comes from?
4x4x4 is just the notation for multiplication of one 4x4 matrix with another 4x4 matrix.
If you were to multiply a 4x8 matrix with a 8x4 matrix, you would have 4x8x4. So if A is NxK and B is KxM, then it can be referred to as a NxKxM matrix multiply.
I just briefly looked up and found this paper, where they use this exact notation (e.g. in Section 4.6 on page 36): https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/153863/eth-6705-01.pdf