hlsl

D3D12 Cooperative Vectors: thread to vector correspondence


The cooperative vectors feature seems a bit magical to me. Before each thread owned it's local variables. Now vector<float, 8> MulRes seems to be shared across the threads (i.e. the first 8 threads own the first 8 elements).

What happens when there are multiple thread groups? Can threads in different thread groups even communicate with each other?

Perhaps I need to read the Nvidia specific blog posts... 1 2


https://devblogs.microsoft.com/directx/cooperative-vector/

The “Cooperative” in Cooperative Vector refers to an implementation detail of the hardware acceleration, where individual vector-matrix multiply requests submitted by threads in a wave are combined into a matrix-matrix operation accelerated collectively for the wave. This name doesn’t appear in HLSL code itself, just vector types and operations like vector-matrix multiplication as shown in the examples below.

// Byte Address Buffers used to store vectors/matrices

ByteAddressBuffer InVectors; 
RWByteAddressBuffer OutVectors;
ByteAddressBuffer InMatrices;
RWByteAddressBuffer OutMatrices;

// System header containing Cooperative Vector types, enums, and functions.
#include <dx/linalg.h>

// Such elements are all under the linalg namespace.
using namespace dx::linalg;

// Hand-wavey utility function to generate the input vector for Mul and MulAdd.
template<typename T, uint N> vector<T,N> GenerateVector(...);

[numthreads(8,1,1)]
[shader("compute")]
void main() {

  // Matrix Vector Multiply Mul() Example

  // Matrix and vector to be multiplied together
  uint MatOffset = 0;
  MatrixRef<DATA_TYPE_FLOAT32, 8, 6, MATRIX_LAYOUT_ROW_MAJOR> MulMatrix = {
    InMatrices, MatOffset, /*stride*/6 * sizeof(float)};
  MatOffset += 8 * 6 * sizeof(float);
  
  vector<float, 6> MulVector = GenerateVector<float, 6>(...);

  vector<float, 8> MulRes =  Mul<float>(MulMatrix,
                    MakeInterpretedVector<DATA_TYPE_FLOAT32>(MulVector));
}

Solution

  • I was wrong. Everything is per thread/lane

    https://developer.nvidia.com/blog/neural-rendering-in-nvidia-optix-using-cooperative-vectors/#why_cooperative_vectors

    Since evaluating an MLP is a series of vector-matrix multiplies, when all the threads in a warp evaluate the same MLP side by side, the cooperative vector API can treat the combined warp’s affine operation as a matrix-matrix multiply plus a bias. This is what cooperative means: threads band together to turn several vector-matrix operations into matrix-matrix operations.

    MatrixRef is shared by all threads


    To understand Cooperative Vectors, the best place to start is the Cuda WMMA API

    https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/#programmatic_access_to_tensor_cores_in_cuda_90

    https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-matrix-functions

    https://github.com/NVIDIA/cuda-samples/blob/8a9e2c830c8a336b4f48c79e5ed5837031eb8551/Samples/3_CUDA_Features/cudaTensorCoreGemm/cudaTensorCoreGemm.cu#L390

    The purpose is to gain programmatic access to Tensor core operations. A warp/wave can execute an (16,16,16) matrix-matrix multiplication on a tensor core. WMMA has an api wmma::load_matrix_sync to get the wrap to cooperatively load data into shared memory. The operation += is result into shared memory.

    How this maps to DX12

    Support for matrix-matrix operations is planned for the future. For now only matrix-vector operations are supported.

    Each thread in the wave owns it's input and output vector. A (16,16)x(16,16) matrix-matrix multiplication is equivalent to 16 (16,16)x(16,1) matrix-vector multiplications (when the matrix is uniform across all muls). The driver must compile arbitrary sized matrix-vector operation into fixed sized (N,N,N)x(N,N,N) multiplications.


    The actual feature I was looking for has been delayed till shader model 6.9 😢

    https://github.com/microsoft/hlsl-specs/pull/61