c++c++11stdvectormemcpy

std::copy() vs memcpy() with std::vectors and fftw_malloc


I know when dealing with std::vector, it's best to use std::copy(), however I was wondering how to copy an fftw_malloc array into a std::vector? Should I use std::copy() or memcpy()?

Example code:

static const int nx = 8;
static const int ny = 8;
static const int nyk = ny/2 + 1;
static const int Mx = 3*nx/2, My= 3*ny/2;

double M[2] = {Mx,My};

fftw_complex *array1;
array1= (fftw_complex*) fftw_malloc(nx*nyk* sizeof(fftw_complex)); 
memset(array1, 42, nx*nyk* sizeof(fftw_complex));

std::vector<std::complex<double>> array2(Mx*My,0.0); 

std::copy( array1, array1+(nx*nyk) , array2); //std::copy( src, src + size, dest );

I am trying to copy array1 into array2, where array2 > array1, then would like to use std::vector operators to append values, and so on.

This doesn't really work, and I am not sure if my vector syntax is off, or std::copy() is not the way to go. Should I use memcpy() instead?


Solution

  • These days, it doesn't make a lot of difference to use fftw_malloc vs. using regularly allocated memory for FFTW operations (see timing results at the end).

    Having said that, the problem in your code is two-fold. First, the last argument of std::copy needs to be an iterator (e.g. array2.begin()). Second, the value type is incompatible; you are transforming fftw_complex into std::complex<double>, which are different types (albeit they have the same bitwise representation).

    So you have several options:

    std::copy( reinterpret_cast<std::complex<double>*>(array1), reinterpret_cast<std::complex<double>*>(array1+(nx*nyk)) , array2.begin());
    

    Which, in turn, can be simplified to

    std::copy_n( reinterpret_cast<std::complex<double>*>(array1), nx*nyk, array2.begin());
    

    If you don't like to reinterpret_cast you can transform the data.

    std::transform( array1, array1 + nx*nyk, array2.begin(), [](auto const& z) {return std::complex<double>(z[0], z[1])});
    

    (there is no std::transform_n).

    You can skip the two-step process:

    std::vector<std::complex<double>> array2(reinterpret_cast<std::complex<double>*>(array1), reinterpret_cast<std::complex<double>*>(array1+(nx*nyk)));
    

    Honestly, I would try to make everything work with normal allocation. Besides, what do you gain by allocating with fftw_malloc to copy immediately to a regular std::vector, possibly losing the alignment properties? (I hope you are trying to copy to the vector after the FFTW operation.)

    The most elegant solution is to have an FFTW-aware allocator altogether. https://gitlab.com/correaa/boost-multi/-/blob/master/include/multi/adaptors/fftw/memory.hpp?ref_type=heads#L27

    using Z = std::complex<double>;
    std::vector<Z, multi::fftw::allocator<Z>> array2(...);
    

    In this way, you don't have to manage any memory and have the benefits of FFTW allocation, if any.


    Finally, std::vector is not even the best container to hold data subject to FFT operation; if nothing else, because most interesting FFTs are multidimensional.

    In my array library, FFTW is adapted and can be used in this way:

    #include <multi/adaptors/fftw.hpp>
    #include <multi/array.hpp>
    
    namespace multi = boost::multi;  // from https://gitlab.com/correaa/boost-multi
    
    int main() {
      multi::array<std::complex<double>, 2, multi::fftw::allocator<> > in({Mx, My});
      ...
      multi::array<std::complex<double>, 2, multi::fftw::allocator<> > out({Mx, My});
      multi::fftw::dft_forward({true, true}, in, out);
      ...
    }
    

    (code in this answer is not tested)


    Timing results, fftw_malloc vs normal allocation:

    I measured, and I don't see any benefit in using fftw_malloc vs. new/std::allocator.

    https://gitlab.com/correaa/boost-multi/-/blob/master/include/multi/adaptors/fftw/benchmark/memory.cpp?ref_type=heads#L94

    These tests are for 1D and 2D data, run in a Intel® Core™ i7-9750H × 12 (48GB), running in Ubuntu 23.04 with FFTW version libfftw3-mpi3/lunar 3.3.10-1ubuntu1 amd64 and g++ 13 with -O3 -DNDEBUG.

    benchmark