Matrix multiplication with SYCL-sub-groups goes wrong

I'm new to SYCL. The following code compiles fine but gives the wrong result. The code computes the product of two matrices (A, R^{M x K} and B, R^{K x N}) into a result matrix (C, R^{M x N}).

#include <array>
#include <sycl/sycl.hpp>

constexpr int M = 17,
              K = 18,
              N = 19,
              tile_size = 9;

int main( void )
{
  // Initialize matrices to be zero and "1" in the diagonal.
  std::array< int, M * K > a;
  for( int i = 0; i < M * K; ++i ) a[i] = 0;
  for( int i = 0; i < (M < K ? M : K); ++i ) a[i * (M < K ? K + 1 : M + 1)] = 1;
  std::array< int, K * N > b;
  for( int i = 0; i < K * N; ++i ) b[i] = 0;
  for( int i = 0; i < (K < N ? K : N); ++i ) b[i * (K < N ? N + 1 : K + 1)] = 1;
  // Result matrix all zeros for now.
  std::array< int, M * N > c;
  for( int i = 0; i < M * N; ++i ) c[i] = 0;

  sycl::queue q;

  sycl::buffer a_buf{ a },
               b_buf{ b },
               c_buf{ c };

  q.submit( [&]( sycl::handler& h )
            {
              sycl::accessor matrix_a{ a_buf, h, sycl::read_only };
              sycl::accessor matrix_b{ b_buf, h, sycl::read_only };
              sycl::accessor matrix_c{ c_buf, h, sycl::write_only, sycl::no_init };

              h.parallel_for( sycl::nd_range< 2 >{ { M, N }, { 1, tile_size } },
                              [=]( sycl::nd_item< 2 > item )
                              {
                                auto sg = item.get_sub_group();
                                const int m = item.get_global_id()[0],
                                          n = item.get_global_id()[1];
                                int i = item.get_local_id()[1];
                                int sum = 0;
                                for( int k = 0; k < K; k += tile_size )
                                {
                                  int tile_a = matrix_a[m * K + k + i];

                                  for( int t = 0; t < tile_size; ++t )
                                    sum += sycl::group_broadcast( sg, tile_a, t ) * 
                                           matrix_b[(k + t) * N + n];
                                }
                                matrix_c[m * N + n] = sum;
                              }
                            );
            }
          );

  sycl::host_accessor host_c{ c_buf, sycl::read_only };
  for( int i = 0; i < N * M; ++i )
    c[i] = host_c[i];

  std::cout << "A=\n";
  for( int m = 0; m < M; ++m )
  {
    for( int k = 0; k < K; ++k )
      std::cout << a[m * K + k] << ' ';
    std::cout << '\n';
  }

  std::cout << "B=\n";
  for( int k = 0; k < K; ++k )
  {
    for( int n = 0; n < N; ++n )
      std::cout << b[k * N + n] << ' ';
    std::cout << '\n';
  }

  std::cout << "C=\n";
  for( int m = 0; m < M; ++m )
  {
    for( int n = 0; n < N; ++n )
      std::cout << c[m * N + n] << ' ';
    std::cout << '\n';
  }

  return 0;
}

It compiles fine with the command line icpx -fsycl matrix_sub_group.cpp -Wall -o matrix_sub_group but gives the wrong result (all zeros).

Surprinsing (for me) is that the result gets right if I choose other combinations of the matrix dimensions. For example

#include <array>
#include <sycl/sycl.hpp>

constexpr int M = 11,
              K = 12,
              N = 13,
              tile_size = 6;

int main( void ) { /* same as before */ }

gives the right result. So I think I'm missing something important.

EDIT

In the meantime, I experimented some more. The original code executed on CPU. If I force it to run on GPU (nvidia), I get the runtime error:

non-uniform work-groups are not supported by the target device.

The only way to avoid that error on GPU is to select a tile-size that divides exactly the matrix dimensions (all of them).

Solution

You submit your command group function to a queue, but don't wait until it finishes execution. From the documentation for submit:

... Submit a command group function object to the queue for asynchronous execution. ...

So, submit immediately returns (and thus, in particular, you could immediately submit more tasks to be executed asynchronously). To wait until everything you submitted to the queue finishes, you can use the wait method:

Performs a blocking wait for the completion of all enqueued tasks in the queue. ...

(this is also reflected in their usage example)

Since you don't do so, there's a data race. When the matrices are smaller, there's a higher chance that the computation finishes before you start reading c from the device (or that, at least, when you read each particular element, it has already been fully computed). When the matrices are bigger, this chance is lower. In your first example, all elements of device's c_buf are still zeros when you retrieve them. In the second example, matrices are small enough that, by the time you read it, c_buf contains correct results consistently (or seemingly consistently; there can be no guarantee here).

So, to solve your problem, just add q.wait() after your submit and before reading the result.

Note, however, that you don't actually have to manually copy the device buffer back to host memory, buffer destructor can do that for you when the provided hostData doesn't point to const:

{
  sycl::buffer a_buf{ a },
               b_buf{ b },
               c_buf{ c };
  
  q.submit( [&]( sycl::handler& h )
            {
              // ...
            }
          );
  // q.wait(); // see below
}
// c contains the result already due to c_buf destruction

The buffer synchronization rules also mandate that the destructor takes care of blocking until all work in queues which use this buffer (i.e. have a corresponding accessor) is completed:

... When the buffer is destroyed, the destructor will block until all work in queues on the buffer have completed, then copy the contents of the buffer back to the host memory (if required) and then return. ...

So, you don't even need to use q.wait() with this approach. This is a more fine-grained blocking mechanism than a general wait because it doesn't wait for completion of other submitted tasks, but in your case (only 1 task) there's no difference.

Update

There are also problems with the logic. Your algorithm, as it is now, depends on several things to be correct:

There is only one subgroup in each work group (common occurence for algorithms using subgroup functionality)
K and N (not necessarily M) are divisible by tile_size

Condition 1 is fulfilled iff the max work group size (tile_size) is less or equal to (and thus fits into) the max subgroup size (which is typically a power of 2 and represents, roughly, SIMT width into which work groups are attempted to be separated, with potential remainder subgroups). Note that in DPC++, if you don't want to leave this choice to the compiler, max subgroup size can be specified by [[intel::reqd_sub_group_size(x)]] attribute on the kernel, where allowed sizes x for your device can be found via q by q.get_device().get_info<sycl::info::device::sub_group_sizes>(). In particular, if you set x >= tile_size, the condition is fulfilled (which in itself could make the larger of your cases, where workgroup fragmentation occurs, seem to work; see below).

Bot conditions are overcomable by slightly changing the kernel. But first:

Behaviour in your two cases

When M = 11, K = 12, N = 13, tile_size = 6, assuming max subgroup size is 8 or larger, your code works correctly by accident because the last column of b is filled with zeros. Note that N isn't divisible by tile_size, and even though there's no out-of-bounds access anywhere, you have, in addition to 'normal' work groups of size 6 (having a single 'remainder' subgroup of size 6), 'remainder' work groups (and, thus, subgroups) of size 1. For them, logic is incorrect, in particular because sycl::group_broadcast(sg, tile_a, t), when t > 0, asks for element from work item which doesn't exist in sg. According to documentation, this is undefined behaviour. Though, as long as this UB is no worse than "return some garbage", you still have correct result because it is multiplied by 0 anyway (from the last column of b). I was curious and tested what the behaviour is in my case (Ryzen 3700X CPU through Intel OpenCL driver), and it worked correctly. Also, group_broadcast consistently returned 0 for out-of-subgroup-bounds t (but see below) when checked using sycl_stream (doc) from within the kernel like this:

for (int t = 0; t < tile_size; ++t) {
    const int sg_size = sg.get_local_range()[0];
    if (sgSize != tile_size) // For t >= sg_size broadcast is undefined
        out << "SG size = " << sg_size << ", t = " << t << ", broadcast tile_a = " << sycl::group_broadcast(sg, tile_a, t) << sycl::endl;
    // ...
}

(out is defined as sycl::stream out(1024, 256, h); at the beginning of the CGF)

When M = 17, K = 18, N = 19, tile_size = 9, assuming max subgroup size exactly 8 (which it is for me and likely for you, otherwise the result would probably be as in the previous case), you have 2 subgroups (with sizes 8 and 1) per "usual" work group of size 9 (and 1 subgroup for "remainder" work group of size 1), so logic of the algorithm is fundamentally incorrect. But, for your rectangular-diagonal matrices, it could still form correct result, because addendands to sum which aren't correctly included due to not having corresponding work-item would still be 0 (if calculated correctly for these matrices) and are 0 if group_broadcast returns 0 for out-of-bounds t. But in this case, t could be equal to 8, which is out of bounds of not just subgroup size, but of max subgroup size. For some reason, this causes UB to "shine" and cause global effects (c being completely filled with 0). When I replaced the broadcast call with (t < 8 ? sycl::group_broadcast(sg, tile_a, t) : 0) to manually zeroize the expression for out-of-max-subgroup-bounds t, the result was correct. Other "fixes" were running the Debug build or including the aforementioned diagnostic stream output (observer effect for quantum mechanics of optimization, right? :) ). The latter, btw, showed the broadcast result to still be zeroized even when t == 8, but when I modified the if condition to output to out only for t == 8, the program hanged altogether (due to some issue/misoptimization manifesting within ScalarToStr, if pausing during debugging the Release build is to be believed). Weird, but UB is UB and has its reputation for a reason.

What to do

First, let's fix tile_size to be a supported max subgroup size, for example 8, and ensure that this max subgroup size is chosen by using [[intel::reqd_sub_group_size(tile_size)]]. Then condition 1 is fulfilled immediately.

Then, to fulfill condition 2, you can, if needed, pad your matrices before computation so that K and N are multiples of tile_size and then extract appropriate submatrix from the result. I'd recommend doing this if you'd rarely (or never) have to do this padding (in particular, if you can work with padded matrices without padding/unpadding them often). However, if you want your algorithm to directly work for arbitrary matrix sizes, we need to slightly change the kernel, and thus get rid of condition 2, as described below.

To make arbitrary K work (assuming N is still divisible by tile_size, thus workgroup/subgroup sizes and max workgroup/subgroup sizes are all equal to tile_size), note that on the last iteration of k-loop there may be (if K isn't divisible by tile_size) less than tile_size elements of a left, yet all work items in the subgroup (with respective i) access this incomplete "tile". This results in out-of-bounds access when calculating tile_a. Also there are redundant t-loop iterations, since each of them corresponds to multiplying by one a element in the current tile, which is smaller than tile_size. We can avoid it like this:

for( int k = 0; k < K; k += tile_size )
{
  const int a_tile_size = std::min(tile_size, K - k);
  const int tile_a = i < a_tile_size ? matrix_a[m * K + k + i] : 0;

  for( int t = 0; t < a_tile_size; ++t )
    sum += sycl::group_broadcast( sg, tile_a, t ) * 
           matrix_b[(k + t) * N + n];
}
matrix_c[m * N + n] = sum;

With this, to also make arbitrary N work, we have 2 options:

Round nd_range global size up to be divisible by 1 x tile_size, i.e., sycl::nd_range< 2 >{ { M, (N + tile_size - 1) / tile_size * tile_size }, { 1, tile_size } }. Then some n in the kernel are out of bounds for matrices b and c, so we have to defend against that:

for( int k = 0; k < K; k += tile_size )
{
  const int a_tile_size = std::min(tile_size, K - k);
  const int tile_a = i < a_tile_size ? matrix_a[m * K + k + i] : 0;

  for( int t = 0; t < a_tile_size; ++t )
    sum += sycl::group_broadcast( sg, tile_a, t ) * 
           (n < N ? matrix_b[(k + t) * N + n] : 0);
}
if (n < N)
  matrix_c[m * N + n] = sum;

Leave nd_range as it is and allow remainder workgroup size (and thus, equal to it, its sole subgroup size denoted sg_size below) to be less than tile_size. Then, in these subgroups, we need to handle a tiles sized sg_size and remainders (that's why we made fix for arbitrary K first, otherwise we'd need to rely on K being divisible by sg_size here):

const int sg_size = sg.get_local_range()[0];
for( int k = 0; k < K; k += sg_size )
{
  const int a_tile_size = std::min(sg_size, K - k);
  const int tile_a = i < a_tile_size ? matrix_a[m * K + k + i] : 0;

  for( int t = 0; t < a_tile_size; ++t )
    sum += sycl::group_broadcast( sg, tile_a, t ) * // No UB because t < sg_size
           matrix_b[(k + t) * N + n];
}
matrix_c[m * N + n] = sum;

The changes in these 2 variants are orthogonal, but there's no need to apply them both. For "remainder" columns of b (if any) work items of corresponding subgroups (there will be M such subgroups, each corresponding to multiplying 1 row of a by these columns, each column having a dedicated work item) will each execute K iterations of the inner t-loop. In variant 1, these iterations have an additional check for n < N, which will likely cause worse performance, while variant 2 doesn't suffer from this, but requires non-uniform workgroup sizes, which your GPU doesn't support.

Finally, note that if you choose variant 2, there's only 1 thing still making condition 1 a neccessity. If we properly define i as index within subgroup (i.e., sg.get_local_linear_id()) rather than index within workgroup, the code will work for any workgroup and subgroup sizes, but it's best to make max workgroup size a multiple of max subgroup size to avoid unneccesary remainder subgroups.