My program have lots of 4-byte-string, like "aaaa" "bbbb" "cccc"... I need to collect particular strings that passes a crc checking.
Because there's very little chance that a string can pass the crc checking, so I don't want to use a very BIG buffer to hold all results. I prefer the result concated one by one, just like the input. For example, if the input is "aaaabbbbcccc" and "bbbb" doesn't pass the crc checking, the output string should be "aaaacccc" and output_count should be 2.
The code looks like:
__device__
bool is_crc_correct(char* str, int len) {
return true; // for simplicity, just return 'true';
}
// arguments:
// input: a sequence of 4-bytes-string, eg: aaaabbbbccccdddd....
__global__
void func(char* input, int* output, int* output_count) {
unsigned int index = blockDim.x*blockIdx.x + threadIdx.x;
if(is_crc_correct(input + 4*index)) {
// copy the string
memcpy(output + (*output_count)*4,
input + 4*index,
4);
// increase the counter
(*output_count)++;
}
}
Obviously the memory copy is not thread safe, I know atomicAdd function can be used for the ++ operation, but how to make both output and output_count thread safe?
What you are looking for is a lock-free linear allocator. The usual way of doing this is by having an atomically-increased accumulator that is used to index into a buffer. For example, in your case, the following should work:
__device__
char* allocate(char* buffer, int* elements) {
// Here, the size of the allocated segment is always 4.
// In a more general use case you would atomicAdd the requested size.
return buffer + atomicInc(elements) * 4;
}
Which can then be used as such:
__global__
void func(char* input, int* output, int* output_count) {
unsigned int index = blockDim.x*blockIdx.x + threadIdx.x;
if(is_crc_correct(input + 4*index)) {
// Reserve the output buffer.
char* dst = allocate(output, output_count);
memcpy(dst, input + 4 * index, 4);
}
}
While this is perfectly thread safe, it is not guaranteed to preserve the input order. For example, "ccccaaaa" would be a valid output.
As Drop has mentioned in their comment, what you are trying to do is effectively a stream compaction (and Thrust already likely already provides what you need).
The code I posted above could be further optimized by first aggregating the output string by warp rather than directly allocating into the global buffer. This would reduce global atomic contention and likely lead to better performance. For an explanation on how to do this, I invite you to read the following article: CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics.