My matrix addition example:
__global__ void matrix_add(float *a, float*b, float *c, int N)
{
int index;
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int index = Row * N + Col; // N is the order of the square matrix
cd[index]= ad[index] + bd[index];
}
Can I use printf or any other similar function in above kernel? So that I won't need to transfer data from device to host memory (i.e. cudaMemcpyDeviceToHost
). If yes then how? If no then why not?
You could use printf(..) but only for cc2.x or higher.
You could read more about this in the CUDA programming guide Appendix B.16.