std::complex in cuda kernels

CUDA allows to run constexpr member functions when compiling with --expt-relaxed-constexpr. This allows to use std::complex<double> in cuda kernels. However, while doing this, I get incorrect results, even though everything compiles correctly.

In the following I am providing an example, which takes an input array, doubles the values and prints the result. When using std::complex<double> the result is incorrect (in the given case, it just copies the values without doubling it). However, when switching to cuda::std::complex<double>, the result is correct.

#include <complex>
#include <cuda/std/complex>
#include <vector>
#include <iostream>

//using C = cuda::std::complex<double>;
using C = std::complex<double>;

__global__ void kernel(const C* in, C* out) {
  const auto idx = blockIdx.x * blockDim.x + threadIdx.x;
  out[idx] = 2.0 * in[idx];
}

int main() {
  int num_values = 3;

  std::vector<C> in_host(num_values);
  std::vector<C> out_host(num_values);

  C *in_device;
  C *out_device;

  cudaMalloc(&in_device, sizeof(C)* num_values);
  cudaMalloc(&out_device, sizeof(C) * num_values);

  for (int i = 0; i < num_values; ++i)
    in_host[i] = C(i, -i);

  cudaMemcpy(in_device, in_host.data(), sizeof(C) * num_values, cudaMemcpyHostToDevice);

  kernel<<<1, num_values>>>(in_device, out_device);

  cudaMemcpy(out_host.data(), out_device, sizeof(C) * num_values, cudaMemcpyDeviceToHost);

  for (int i = 0; i < num_values; ++i) {
    std::cout << "Expected: " << (2.0 * in_host[i].real()) << ", " << (2.0 * in_host[i].imag())
              << ", actual: " << out_host[i].real() << ", " << out_host[i].imag()
              << std::endl;
  }

  cudaFree(in_device);
  cudaFree(out_device);
}

I compiled the code with nvcc, version 12.3:

nvcc -std=c++20 --expt-relaxed-constexpr complex_kernel.cu

Expected result from the cout (expected and actual should be the same):

Expected: 0, 0, actual: 0, 0
Expected: 2, -2, actual: 2, -2
Expected: 4, -4, actual: 4, -4

Actual result, what I get (actual results are NOT doubled and so do not match the expected results):

Expected: 0, 0, actual: 0, 0
Expected: 2, -2, actual: 1, -1
Expected: 4, -4, actual: 2, -2

Now, I can of course switch to cuda::std::complex<double>. But it gives me kind of a bad feeling using --expt-relaxed-constexpr at all. Does anyone has an explanation for this.

Solution

All code restrictions applicable to a ``__device__`` function are also applicable to the ``constexpr host``-only function ``H`` that is called from device code. However, compiler may not emit any build time diagnostics for ``H`` for these restrictions.
_{CUDA C++ Programming Guide - C++ Language Support - Relaxed Constexpr}

So there seems to be an operation in std::complex that causes a silent error. I think the problem is here from the GNU C++ stdlib:

complex&
operator*=(double __d)
{
    _M_value *= __d;
    return *this;
 }

The templated std::complex for double uses C complex (_Complex) numbers. Here's a condensed version of the std::complex template:

template<> struct my_complex<double>
{
    constexpr my_complex(
        const double __r = 0.0,
        const double __i = 0.0) 
    {
        __real__ _M_val = __r;
        __imag__ _M_val = __i;
    }

    constexpr double real() { return __real__ _M_val; }
    constexpr double imag() { return __imag__ _M_val; }
    constexpr void real(double val) { __real__ _M_val = val; }
    constexpr void imag(double val) { __imag__ _M_val = val; }

    constexpr my_complex<double>&
    operator*=(double v)
    {
        _M_val *= v;
        return *this;
    }

    private:
    double _Complex _M_val;
};

template<typename T>
constexpr inline my_complex<T> operator*(
    const my_complex<T>& lhs,
    const T& rhs)
{
    my_complex<T> ret = lhs;
    ret *= rhs;
    return ret;
}

This will compile and run on your kernel with the same bad result. This appears to be because the _M_val *= v line doesn't compile successfully. You can swap the line for something that manually multiplies the components and it works. I also tested this by altering your kernel:

__global__ void kernel(const C* in, C* out) {
  const auto idx = blockIdx.x * blockDim.x + threadIdx.x;
  out[idx] = in[idx] * 2.0;

  _Complex double my_double;
  __real__ my_double = double(idx);
  __imag__ my_double = -double(idx);
  my_double *= 25;

  printf("my_double(%x) = %f + i%f\n", idx, __real__ my_double, __imag__ my_double);
}

You will get the following error on the my_double*=25 line.

<source>(88): error: cannot use an entity undefined in device code

While this is correct (GNU) C++ code used outside of the device kernel, Cuda seems to fail to compile it correctly. In summary, the experimental aspect of this compiler flag seems to be still quite true.

Edit: Since you can get the result with __device__ attributes, perhaps this is a compiler bug or missing feature: If it's not supported, it should give you an error in this case.

Update: C99 _Complex is not standard C++, and isn't supported by this C++ compiler, though libstdc++ relies on _Complex (thanks @paleonix).