cmakefilecudadynamic-parallelism

compile multiple cuda files (that have dynamic parallelism) and MPI code


I have a bunch of .cu files that use dynamic parallelism (a.cu, b.cu, c.cu.., e.cu, f.cu), and a main.c file that uses MPI to call functions from a.cu on multiple nodes. I'm trying to write a make file to compile the executable, but I keep facing the following errors:

cudafiles.o: In function `__cudaRegisterLinkedBinary_66_tmpxft_00001a84_00000000_17_cuda_device_runtime_compute_61_cpp1_ii_8b1a5d37':
link.stub:(.text+0x1fb): undefined reference to `__fatbinwrap_66_tmpxft_00001a84_00000000_17_cuda_device_runtime_compute_61_cpp1_ii_8b1a5d37'

Here is my makefile:

INCFILES=-I/usr/local/cuda-8.0/include -I/opt/mpi/mvapich2-gnu/2.2/include -I./
LIBFILES=-L/usr/local/cuda-8.0/lib64 -L/opt/mpi/mvapich2-gnu/2.2/lib
LIBS=-lcudart -lcudadevrt -lcublas_device -lmpi 
ARCH=-gencode arch=compute_60,code=sm_60
NVCC=nvcc -ccbin g++


default: all

all: clean final.o

io.o: io.cpp
        g++ -c -std=c++11  io.cpp 


final.o: io.o a.cu b.cu c.cu d.cu e.cu f.cu main.cpp
        $(NVCC) -std=c++11 $(INCFILES) $(LIBFILES) $(LIBS) -g -G -Xptxas -v -dc $(ARCH) a.cu b.cu c.cu d.cu e.cu f.cu
        $(NVCC) -std=c++11 $(ARCH) $(INCFILES) $(LIBFILES) $(LIBS) -rdc=true -dlink a.o b.o c.o d.o e.o f.o io.o -o cudafiles.o
        mpicxx -O3 $(INCFILES) $(LIBFILES) -c main.cpp -o main.o
        mpicxx $(INCFILES) $(LIBFILES) $(LIBS) cudafiles.o a.o b.o c.o d.o e.o f.o io.o main.o -o exec

clean:
        rm -rf *.o exec

Solution

    1. The original problem reported was an undefined reference to main. This was arising from this line in the Makefile:

      $(NVCC) -std=c++11 $(ARCH) $(INCFILES) $(LIBFILES) $(LIBS) -rdc=true a.o b.o c.o d.o e.o f.o io.o -o cudafiles.o
      

      As constructed, this actually instructs nvcc to perform full/final linking. However the intent of this line was to perform the device-link step only, required when compiling with -rdc=true or -dc, and when not performing the final link with nvcc. In this case, the final link was being performed by mpicc/mpicxx. To perform the device-link step only, we need to specify -dlink. Without that switch, nvcc expects to do final linking, but fails because none of the supplied objects contain a main function. The correct solution, since we have no intent to do final link at this point, is to use the -dlink switch.

    2. I also suggested converting everything to C++ style linking, since nvcc links that way. It might be possible to sort out a C-style link with a C++-style link, but this just seems troublesome to me. Therefore I suggested converting the only .c file (main.c) to a .cpp file, and convert from mpicc to mpicxx

    3. The next problem that arose was undefined references to e.g. cudaSetDevice() and cudaFree(). These are part of the CUDA runtime API library ("libcudart"). When performing final link with nvcc, these are linked automatically. But since final link is being performed by mpicxx (basically a wrapper on g++), it's necessary to call out the link against that library specifically with -lcudart.

    4. Finally, the remaining problem was a link-order problem. In a nutshell, link dependencies need to be satisfied from left to right in the linker command line. Different compilers are more or less picky about this. The final reordering changes were to specify the libraries to link against in the correct order, and also to specify these libraries at the end of the link command line, so that any dependencies on these libraries, to their left in the link command line, are satisfied.