I wrote a toy code to test some ideas
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
#include <iostream>
#include <array>
#include <vector>
#define N 20
struct func {
__host__ __device__
float operator()(float x) { return x*2; }
};
template <typename S>
struct O {
const std::array<float,2> a;
O(std::array<float,2> a): a(a) {}
S f;
__host__ __device__
float operator()(float &v) {
std::array<int,3> b = {2,3,4};
int tmp;
for (int i=0; i<3; i++) {
tmp = thrust::reduce(thrust::device,b.begin(),b.end(),0);
printf("%d",tmp);
}
return a[0]*v + a[1] + f(a[0]);
}
};
int main(void) {
thrust::host_vector<float> _v1(N);
thrust::device_vector<float> v1 = _v1, v2;
thrust::fill(v1.begin(),v1.end(),12);
v2.resize(N);
std::array<float,2> a{1,2};
auto c_itor = thrust::make_counting_iterator(0);
thrust::transform(v1.begin(),v1.end(),v2.begin(),O<func>(a));
thrust::copy(v2.begin(),v2.end(),std::ostream_iterator<float>(std::cout," "));
}
This code runs perfectly when using nvcc --expt-relaxed-constexpr -std=c++17
. One can see that there are a lot of std containers like std::array
occur in a __host__ __device__
functor, what I want to know is
The special case of using std::array
with C++17 or higher and --expt-relaxed-constexpr
works because std::array
is a very thin wrapper around a C-style array and with C++17 all member functions that you used are constexpr
. I think all member functions but std::array::fill
and std::array::swap
are constexpr by C++17. These two got the constexpr
treatment with C++20.
So for performance considerations your code should perform the same as when using float a[2]
and int b[3]
. This means that the values are stored in registers if possible (this depends on loop-unrolling for b
and generally register pressure). This is fine as long as you don't go overboard with the size of the arrays. See e.g. this answer for a deeper discussion of arrays, registers and local memory.
For other STL containers using dynamic memory you probably wont be as lucky in terms of member functions being constexpr
. The HPC nvc++
compiler (former PGI C++ compiler) does not need __device__
markers, so in theory one can use a lot more STL functionality in device code but in most cases that is a bad idea in terms of performance. STL functions must also still conform to CUDA's C++ Language Restrictions.
Nvidia is developing its own C++ standard library implementation with its own device extensions in libcu++. By now it has a cuda::std::array
that can be used in device code without special compiler flags but no other containers. For hash tables there is the cuCollections library (WIP).