c++cudagpu-cooperative-groups

Should thread_block type be passed by reference?


Question

When passing thread_group type objects to a device function, is there a preference for passing by reference vs passing by value?

Examples

Similar examples in the programming-guide and the developer blog seem to handle this differently.

Programming Guide

__device__
int sum(const thread_block& g, int *x, int n) {
    // ...
    g.sync()
    return total;
}

Developer Blog

__device__
int sum(thread_block block, int *x, int n) {
    ...
    block.sync();
    ...
    return total;
}

Additional Info

The programming-guide also has this to say about constructing implicit groups:

Although you can create an implicit group anywhere in the code, it is dangerous to do so. Creating a handle for an implicit group is a collective operation—all threads in the group must participate. If the group was created in a conditional branch that not all threads reach, this can lead to deadlocks or data corruption. For this reason, it is recommended that you create a handle for the implicit group upfront (as early as possible, before any branching has occurred) and use that handle throughout the kernel. Group handles must be initialized at declaration time (there is no default constructor) for the same reason and copy-constructing them is discouraged.

Which would lead me to believe passing them by reference is preferred, but I will admit there is more than enough detail underlying the various cooperative groups that it's likely I've missed some nuance. Would passing by value be considered "copy-constructing" and therefor be discouraged?

I have not noticed any performance or result difference using either one, but I may have just not tested the correct edge case; or the "undefined-behavior" may just be working out in a way that doesn't cause a problem.


Solution

  • A few observations first:

    1. The blog post you refer to is from 2017 when the feature was previewed, the documentation is current. On that basis alone you should favour the const pass-by-reference idiom because the source is newer.
    2. As you have proved yourself, because CUDA uses a highly stripped back implementation of the C++ object model, and the compiler loves inline function expansion for performance, it is very unlikely that you would find real world cases where the compiler would generate different code for the two cases.

    As I result, I would opine that the const pass-by-reference version is what you should use, both from a C++ language correctness POV, and because the current documentation suggests you should. There are probably corner cases where someone, somewhere, sometime, got burned by copy construction in the pass-by-value version, but I suspect you would have to try very hard for that to happen. Caveat emptor and all of that….