parallel-processinggpunvidiaamd-gpusycl

How to optimize SYCL kernel


I'm studying SYCL at university and I have a question about performance of a code. In particular I have this C/C++ code:

c code

And I need to translate it in a SYCL kernel with parallelization and I do this:

#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
  // Create a vector with size elements and initialize them to 1
  std::vector<float> dA(size); 
  try {
    queue gpuQueue{ gpu_selector{} };
    buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
    gpuQueue.submit([&](handler& cgh) {
                    accessor inA{ bufA,cgh };
                    cgh.parallel_for(range<1>(size),
                                     [=](id<1> i) { inA[i] = inA[i] + 2; }
                    );
    });
    gpuQueue.wait_and_throw();
  }
  catch (std::exception& e) { throw e; }
}

So my question is about c value, in this case I use directly the value two but this will impact on the performance when I'll run the code? I need to create a variable or in this way is correct and the performance are good?


Solution

  • Interesting question. In this case the value 2 will be a literal in the instruction in your SYCL kernel - this is as efficient as it gets, I think! There's the slight complication that you have an implicit cast from int to float. My guess is that you'll probably end up with a float literal 2.0 in your device assembly. Your SYCL device won't have to fetch that 2 from memory or cast at runtime or anything like that, it just lives in the instruction.

    Equally, if you had:

    constexpr int c = 2;
    // the rest of your code
    [=](id<1> i) { inA[i] = inA[i] + c; }
    // etc
    

    The compiler is almost certainly smart enough to propagate the constant value of c into the kernel code. So, again, the 2.0 literal ends up in the instruction.

    I compiled your example with DPC++ and extracted the LLVM IR, and found the following lines:

      %5 = load float, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
      %add.i = fadd float %5, 2.000000e+00
      store float %add.i, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
    

    This shows a float load & store to/from the same address, with an 'add 2.0' instruction in between. If I modify to use the variable c like I demonstrated, I get the same LLVM IR.

    Conclusion: you've already achieved maximum efficiency, and compilers are smart!