cudaprintf

Printf() 45 arguments in CUDA


Parallel printing makes it very difficult to trace variables because of the order in which threads fire.

For this reason, I print many variables in one line for testing.

I'm trying to print 45 indices from a bool array in printf like so:

printf( "[%hu%hu%hu%hu%hu][%hu%hu%hu%hu%hu][%hu%hu%hu%hu%hu]\n"
        "[%hu%hu%hu%hu%hu][%hu%hu%hu%hu%hu][%hu%hu%hu%hu%hu]\n"
        "[%hu%hu%hu%hu%hu][%hu%hu%hu%hu%hu][%hu%hu%hu%hu%hu]\n\n",
u[0],
u[1],
u[2],
u[3],
u[4],
u[5],
u[6],
u[7],
u[8],
u[9],
u[10],
u[11],
u[12],
u[13],
u[14],
u[15],
u[16],
u[17],
u[18],
u[19]
u[20],
u[21],
u[22],
u[23],
u[24],
u[25],
u[26],
u[27],
u[28],
u[29],
u[30],
u[31],
u[32],
u[33],
u[34],
u[35],
u[36],
u[37],
u[38],
u[39],
u[40],
u[41],
u[42],
u[43],
u[44],
u[45]);

This is obviously running inside a kernel, and it works, but the output comes out like this:

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

[11111][11111][11111]
[11111][11111][11111]
[1195633005623389][266612392595893005626661][95899589300563005626661]

With each thread output here separated by the double line.

What I've tried so far:

    size_t sz = 1048576 * 10000;
    cudaDeviceSetLimit(cudaLimitPrintfFifoSize, sz);

No clue why I'm getting garbage values at the end -- my best guess is the stack itself has run out of memory?

Does anyone know how I can printf 45 arguments inside a cuda kernel?

NOTE: I'm only using the plain C variety, compiling with nvcc and Visual Studio's cl.exe directly


Solution

  • As others have mentioned, there is a limit of 32 arguments for a cuda device printf.

    AFAICT, the only reason you're doing a printf with so many args is that you want to have one line for a given thread printf. And be able to separate the output from various threads.

    There is a workaround of sorts ...

    For debug printing in a multiprocessing environment, [I've found] it helps to associate a message with a timestamp and a thread identifier. And, optionally, a function name and line number.

    Caveat: I'm only slightly familiar with cuda in general, but I've done similar debug in the past ...

    For cuda devices, clock64 returns the timestamp counter value. And, threadIdx.x is a thread identifier of sorts.

    AFAICT, because cuda doesn't have vprintf, we should use a macro. Loosely:

    Edit: Adjusted per paleonix

    #if MY_CUDA_DEBUG
    #define dbgprt(_fmt,_av...) \
        printf("[%llu/%u.%u] " _fmt, \
            clock64(),blockIdx.x,threadIdx.x,_av)
    #else
    #define dbgprt(_fmt,_av...) \
        do { } while (0)
    #endif
    
    dbgprt("hello %s\n","world");
    

    The preprocessor output is:

    printf("[%llu/%u.%u] " "hello %s\n", clock64(),blockIdx.x,threadIdx.x,"world");
    

    We divert all printf output into a file. Then, after a run, on the host we can use a post-processing script to separate the various messages into per-thread logfiles by parsing the per-line prefix information.

    Of course, this is only a simple example. Adjust what appears in the [] part of the message as needed.