I'm trying to start using Cupy for some Cuda Programming. I need to write my own kernels. However, I'm struggling with 2D kernels. It seems that Cupy does not work the way I expected. Here is a very simple example of a 2D kernel in Numba Cuda:
import cupy as cp
from numba import cuda
@cuda.jit
def nb_add_arrs(x1, x2, y):
i, j = cuda.grid(2)
if i < y.shape[0] and j < y.shape[1]:
y[i, j] = x1[i, j] + x2[i, j]
x1 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
x2 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
y = cp.zeros((5, 5), dtype=cp.int32)
# Grid and block sizes
tpb = (16, 16)
bpg = (x1.shape[0] // tpb[0] + 1, x1.shape[1] // tpb[0] + 1)
# Call kernel
nb_add_arrs[bpg, tpb](x1, x2, y)
The result is, as expected:
y
[[2 2 2 2 2]
[2 2 2 2 2]
[2 2 2 2 2]
[2 2 2 2 2]
[2 2 2 2 2]]
However, when I try to do this simple kernel in Cupy, I don't get the same.
cp_add_arrs = cp.RawKernel(r'''
extern "C" __global__
void add_arrs(const float* x1, const float* x2, float* y, int N){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
if(i < N && j < N){
y[i, j] = x1[i, j] + x2[i, j];
}
}
''', 'add_arrs')
x1 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
x2 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
y = cp.zeros((5, 5), dtype=cp.int32)
N = x1.shape[0]
# Grid and block sizes
tpb = (16, 16)
bpg = (x1.shape[0] // tpb[0] + 1, x1.shape[1] // tpb[0] + 1)
# Call kernel
cp_add_arrs(bpg, tpb, (x1, x2, y, cp.int32(N)))
y
[[0 0 0 0 0]
[0 0 0 0 0]
[0 0 0 0 0]
[0 0 0 0 0]
[0 0 0 0 0]]
Can someone help me figure out why?
Memory in C is stored in a row-major-order. So, we need to index following this order. Also, since I'm passing int arrays, I changed the argument types of my kernel. Here is the code:
cp_add_arrs = cp.RawKernel(r'''
extern "C" __global__
void add_arrs(int* x1, int* x2, int* y, int N){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
if(i < N && j < N){
y[j + i*N] = x1[j + i*N] + x2[j + i*N];
}
}
''', 'add_arrs')
x1 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
x2 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
y = cp.zeros((5, 5), dtype=cp.int32)
N = x1.shape[0]
# Grid and block sizes
tpb = (16, 16)
bpg = (x1.shape[0] // tpb[0] + 1, x1.shape[1] // tpb[0] + 1)
# Call kernel
cp_add_arrs(bpg, tpb, (x1, x2, y, cp.int32(N)))
y
array([[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2]], dtype=int32)