The cooperative vectors feature seems a bit magical to me. Before each thread owned it's local variables. Now vector<float, 8> MulRes
seems to be shared across the threads (i.e. the first 8 threads own the first 8 elements).
What happens when there are multiple thread groups? Can threads in different thread groups even communicate with each other?
Perhaps I need to read the Nvidia specific blog posts... 1 2
https://devblogs.microsoft.com/directx/cooperative-vector/
The “Cooperative” in Cooperative Vector refers to an implementation detail of the hardware acceleration, where individual vector-matrix multiply requests submitted by threads in a wave are combined into a matrix-matrix operation accelerated collectively for the wave. This name doesn’t appear in HLSL code itself, just vector types and operations like vector-matrix multiplication as shown in the examples below.
// Byte Address Buffers used to store vectors/matrices
ByteAddressBuffer InVectors;
RWByteAddressBuffer OutVectors;
ByteAddressBuffer InMatrices;
RWByteAddressBuffer OutMatrices;
// System header containing Cooperative Vector types, enums, and functions.
#include <dx/linalg.h>
// Such elements are all under the linalg namespace.
using namespace dx::linalg;
// Hand-wavey utility function to generate the input vector for Mul and MulAdd.
template<typename T, uint N> vector<T,N> GenerateVector(...);
[numthreads(8,1,1)]
[shader("compute")]
void main() {
// Matrix Vector Multiply Mul() Example
// Matrix and vector to be multiplied together
uint MatOffset = 0;
MatrixRef<DATA_TYPE_FLOAT32, 8, 6, MATRIX_LAYOUT_ROW_MAJOR> MulMatrix = {
InMatrices, MatOffset, /*stride*/6 * sizeof(float)};
MatOffset += 8 * 6 * sizeof(float);
vector<float, 6> MulVector = GenerateVector<float, 6>(...);
vector<float, 8> MulRes = Mul<float>(MulMatrix,
MakeInterpretedVector<DATA_TYPE_FLOAT32>(MulVector));
}
I was wrong. Everything is per thread/lane
Since evaluating an MLP is a series of vector-matrix multiplies, when all the threads in a warp evaluate the same MLP side by side, the cooperative vector API can treat the combined warp’s affine operation as a matrix-matrix multiply plus a bias. This is what cooperative means: threads band together to turn several vector-matrix operations into matrix-matrix operations.
MatrixRef
is shared by all threads
To understand Cooperative Vectors, the best place to start is the Cuda WMMA API
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-matrix-functions
The purpose is to gain programmatic access to Tensor core operations. A warp/wave can execute an (16,16,16) matrix-matrix multiplication on a tensor core. WMMA has an api wmma::load_matrix_sync
to get the wrap to cooperatively load data into shared memory. The operation +=
is result into shared memory.
Support for matrix-matrix operations is planned for the future. For now only matrix-vector operations are supported.
Each thread in the wave owns it's input and output vector. A (16,16)x(16,16)
matrix-matrix multiplication is equivalent to 16 (16,16)x(16,1)
matrix-vector multiplications (when the matrix is uniform across all muls). The driver must compile arbitrary sized matrix-vector operation into fixed sized (N,N,N)x(N,N,N)
multiplications.
The actual feature I was looking for has been delayed till shader model 6.9 😢