I have some kernel code which calls memcpy(my_dst, my_src, my_num_bytes)
- and sometimes I have my_num_bytes
equal to 0. Strangely enough, some sporadic experimentation (with Titan X, CUDA 7.5, driver 358.16) suggests data does get written into the destination when I use such a call.
memcpy()
doesn't say.This would appear to be a bug in the (current, i.e. CUDA 7.5) implementation of device-side memcpy()
.
A kernel like this:
__global__ void kernel(char* source, char* dst, int len, int sz)
{
int i = threadIdx.x * len;
memcpy(source+i, dst+i, sz);
}
leads the toolchain to emit PTX like this:
// .globl _Z6kernelPcS_ii
.visible .entry _Z6kernelPcS_ii(
.param .u64 _Z6kernelPcS_ii_param_0,
.param .u64 _Z6kernelPcS_ii_param_1,
.param .u32 _Z6kernelPcS_ii_param_2,
.param .u32 _Z6kernelPcS_ii_param_3
)
{
.reg .pred %p<2>;
.reg .b16 %rs<2>;
.reg .b32 %r<4>;
.reg .b64 %rd<15>;
ld.param.u64 %rd7, [_Z6kernelPcS_ii_param_0];
ld.param.u64 %rd8, [_Z6kernelPcS_ii_param_1];
ld.param.u32 %r1, [_Z6kernelPcS_ii_param_2];
cvta.to.global.u64 %rd9, %rd8;
cvta.to.global.u64 %rd10, %rd7;
mov.u32 %r2, %tid.x;
mul.lo.s32 %r3, %r2, %r1;
cvt.s64.s32 %rd11, %r3;
add.s64 %rd1, %rd10, %rd11;
add.s64 %rd2, %rd9, %rd11;
mov.u64 %rd14, 0;
ld.param.s32 %rd3, [_Z6kernelPcS_ii_param_3];
BB6_1:
add.s64 %rd12, %rd2, %rd14;
ld.global.u8 %rs1, [%rd12];
add.s64 %rd13, %rd1, %rd14;
st.global.u8 [%rd13], %rs1;
add.s64 %rd14, %rd14, 1;
setp.lt.u64 %p1, %rd14, %rd3;
@%p1 bra BB6_1;
ret;
}
My reading is that this code will always copy at least one byte because the value of the length argument isn't tested until after the byte copy. Something like this:
BB6_1:
setp.ge.u64 %p1, %rd14, %rd3;
@%p1 bra Done;
add.s64 %rd12, %rd2, %rd14;
ld.global.u8 %rs1, [%rd12];
add.s64 %rd13, %rd1, %rd14;
st.global.u8 [%rd13], %rs1;
add.s64 %rd14, %rd14, 1;
bra BB6_1;
Done:
would probably work as expected.