Printf() 45 arguments in CUDA

Parallel printing makes it very difficult to trace variables because of the order in which threads fire.

For this reason, I print many variables in one line for testing.

I'm trying to print 45 indices from a bool array in printf like so:

printf( "[%hu%hu%hu%hu%hu][%hu%hu%hu%hu%hu][%hu%hu%hu%hu%hu]\n"
        "[%hu%hu%hu%hu%hu][%hu%hu%hu%hu%hu][%hu%hu%hu%hu%hu]\n"
        "[%hu%hu%hu%hu%hu][%hu%hu%hu%hu%hu][%hu%hu%hu%hu%hu]\n\n",
u[0],
u[1],
u[2],
u[3],
u[4],
u[5],
u[6],
u[7],
u[8],
u[9],
u[10],
u[11],
u[12],
u[13],
u[14],
u[15],
u[16],
u[17],
u[18],
u[19]
u[20],
u[21],
u[22],
u[23],
u[24],
u[25],
u[26],
u[27],
u[28],
u[29],
u[30],
u[31],
u[32],
u[33],
u[34],
u[35],
u[36],
u[37],
u[38],
u[39],
u[40],
u[41],
u[42],
u[43],
u[44],
u[45]);

This is obviously running inside a kernel, and it works, but the output comes out like this:

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

With each thread output here separated by the double line.

What I've tried so far:

the printf works perfectly fine outside of the kernel
I've changed the order of the indices, and only the last 15 args come out like garbage
I've tried increasing the printf buffer size and also using shorts %hu

    size_t sz = 1048576 * 10000;
    cudaDeviceSetLimit(cudaLimitPrintfFifoSize, sz);

No clue why I'm getting garbage values at the end -- my best guess is the stack itself has run out of memory?

Does anyone know how I can printf 45 arguments inside a cuda kernel?

NOTE: I'm only using the plain C variety, compiling with nvcc and Visual Studio's cl.exe directly

Solution

As others have mentioned, there is a limit of 32 arguments for a cuda device printf.

AFAICT, the only reason you're doing a printf with so many args is that you want to have one line for a given thread printf. And be able to separate the output from various threads.

There is a workaround of sorts ...

For debug printing in a multiprocessing environment, [I've found] it helps to associate a message with a timestamp and a thread identifier. And, optionally, a function name and line number.

Caveat: I'm only slightly familiar with cuda in general, but I've done similar debug in the past ...

For cuda devices, clock64 returns the timestamp counter value. And, threadIdx.x is a thread identifier of sorts.

AFAICT, because cuda doesn't have vprintf, we should use a macro. Loosely:

Edit: Adjusted per paleonix

#if MY_CUDA_DEBUG
#define dbgprt(_fmt,_av...) \
    printf("[%llu/%u.%u] " _fmt, \
        clock64(),blockIdx.x,threadIdx.x,_av)
#else
#define dbgprt(_fmt,_av...) \
    do { } while (0)
#endif

dbgprt("hello %s\n","world");

The preprocessor output is:

printf("[%llu/%u.%u] " "hello %s\n", clock64(),blockIdx.x,threadIdx.x,"world");

We divert all printf output into a file. Then, after a run, on the host we can use a post-processing script to separate the various messages into per-thread logfiles by parsing the per-line prefix information.

Of course, this is only a simple example. Adjust what appears in the [] part of the message as needed.