When you print, in parallel, (CUDA) all of the threads fire at once.
This makes it very difficult when printing output, so as to trace a variable, because the output is essentially out of order.
Instead of:
You get:
You can imagine how difficult this is when you have dozens of threads running 10s of lines of code pushing literally 1000s of lines of output.
So far I've just been printing all of a kernel's variables in one line.
I've already run into problems doing this which can be seen
here
The only other solution I can think of is to tag each line, output to a text file, and write a program to re-order the data. This would be incredibly time consuming and process heavy.
Is there any better way to print data to trace variables when the prints are firing in parallel? I am using Nvidia's cuda toolkit in plain C.
Make sure you include a '\n'
at the end of every print statement, and tag every line with the thread info.
I use the following code to do printing.
#define println(format, ...) \
printf("T:%02i W:%02i B:%02i Line:%4i " format "\n", \
threadIdx.x, threadIdx.x / 32, blockIdx.x, __LINE__, __VA_ARGS__)
//note no trailing ;
If you do your printing in a single printf
statement with a trailing \n
, CUDA will always print that line separate from the lines of other threads and never intermingles prints.
Do not split the printing into multiple lines, because then output from other threads will be interspersed.
If you find that you need more arguments than the 32 - 4 = 28 that you have, then you and do a format
, create some substring, and use a %s
argument to paste in your printf
. Or, even better just print more lines.
This can make your output more verbose, but you can always post process them. The labeling of thread info is vital, because otherwise the output is useless for debugging.
That said, I would advise against reordering the output in post processing, because you lose the ordering info (which threads run in what order), that can be very useful when debugging issues.