Why doesn't cuobjdump emit the PTX mul instruction below? Has nvcc optimized the cubin output iteself? Is the result calculated at compile-time? If so, for this simplest case nvcc can reasonably further optimize the output w/o generating any instructions on the device side at all.
mul.cu
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void mul(float *res) {
float x = 11.1, y = 22.2;
*res = x * y;
}
int main() {
float *res;
cudaMallocManaged(&res, sizeof(float));
mul<<<1, 1>>>(res);
cudaDeviceSynchronize();
printf("11.1 * 22.2 = %f\n", *res);
}
Problem
$ nvcc mul.cu -o mul
$ ./mul
11.1 * 22.2 = 246.420013
$ cuobjdump -fun mul -ptx ./mul
...
.visible .entry _Z3mulPf(
.param .u64 _Z3mulPf_param_0
)
{
.reg .b32 %r<2>;
.reg .b64 %rd<3>;
ld.param.u64 %rd1, [_Z3mulPf_param_0];
cvta.to.global.u64 %rd2, %rd1;
mov.u32 %r1, 1131834246;
st.global.u32 [%rd2], %r1;
ret;
}
Is the result calculated at compile-time?
Yes.
The compiler can observe that the result will always be 11.1x22.2, so it simply puts that value (when the float
bit pattern is viewed as a decimal integer: 1131834246) into the result location.
If you want to see the mul instruction, make the multiplication input values be kernel arguments:
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void mul(float *res, float x, float y) {
*res = x * y;
}
int main() {
float *res;
cudaMallocManaged(&res, sizeof(float));
mul<<<1, 1>>>(res, 11.1, 22.2);
cudaDeviceSynchronize();
printf("11.1 * 22.2 = %f\n", *res);
}
If so, for this simplest case nvcc can reasonably further optimize the output w/o generating any instructions on the device side at all.
Any proper optimization still has to have the same result in global state. So in this case, it would probably be acceptable to replace the kernel with a cudaMemcpy
type operation (or, since it is managed memory, perhaps simply a memcpy
or some other memory setting operation), but I don't think the compiler ever tries to do that sort of optimization.