Recently I began extending a very boost dependent project to use CUDA for its innermost loop. I thought it would be worth posting here about some odd behaviour I've been seeing though. Simply including certain boost headers will cause my first cuda call to generate a large number of kernels.
If compile and debug the following code: simplestCase.cu
#include <boost/thread.hpp>
int main(int argc, char **argv){
int *myInt;
cudaMalloc(&myInt, sizeof(int));
return 0;
}
I get the following debug message lines upon executing cudaMalloc (same behaviour if I run a kernel I've defined. Seems like anything that triggers context creation will trigger this.):
[Launch of CUDA Kernel 0 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 1 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 2 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 3 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 4 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 5 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 6 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 7 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 8 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
So far I have identified two headers that cause the problem: boost/thread.hpp boost/mpi.hpp
Here's a bit of info that may be useful in replicating the problem:
project settings:
I think that's everything.
Edit:
Thank you for bringing my attention to the fact that I hadn't asked a question. I knew I was forgetting something critical. My question is this:
It seems odd to me that very specific includes on their generate peripheral kernel calls, particularly since I don't use those includes, and I don't see how they could affect my interaction with CUDA. Should cuda be launching this many extra kernels for code I'm not even using? I see over 100 kernels launched in the project I'm working on now when the only CUDA related code I have in my project is a single cudaMalloc at the program's entry point.
Edit2:
Also happens on a Tesla K20 (kepler architecture card, whereas I think the GTX 580 is fermi).
Edit3:
Updated cuda driver to version 319.23. No change in the behaviour I mentioned above, but this did fix the debugger issues I was having in larger programs.
Well, still no actual issues arising from this, so I suppose it's simply stuff that happens in the background.