I have this code for matrix multiplication using pyopenCL. My problem is that the result is wrong in some matrices, and I dont understand why. After some research i think its related with global size of something like that but i dont understand how to set that values.
For example:
matrices using numpy dtype = float32
matrix 1:
[[ 0.99114645 0.09327769 0.90075564 0.8913309 ]
[ 0.59739089 0.13906649 0.94246316 0.65673178]
[ 0.24535166 0.68942326 0.41361505 0.5789603 ]
[ 0.31962237 0.17714553 0.49025267 0.21861202]]
matrix2:
[[ 0.41509482 0.82779616 0.74143827 0.37681136]
[ 0.88058949 0.01039944 0.4342753 0.45752665]
[ 0.60375261 0.21243185 0.88312167 0.97394323]
[ 0.60855824 0.69482827 0.61627114 0.57155776]]
expected result:
[[ 1.57981943 1.63210835 2.12016045 1.80288424]
[ 1.3391085 1.15248911 1.7403561 1.58199609]
[ 1.31099532 0.70041376 1.20338154 1.14162762]
[ 0.71769556 0.52246746 0.88158722 0.8039138 ]]
script result:
[[ 1.20828819 0.73175305 1.64546931 1.42526579]
[ 1.13179159 0.46403384 1.20692348 1.14317513]
[ 1.25328159 0.86723316 1.58679342 1.40186214]
[ 1.35214019 0.6795128 1.73811913 1.48048854]]
script:
def openCL_multiplication(matrix1, matrix2, res):
import pyopencl as cl
import numpy as np
import numpy.linalg as la
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=matrix1)
b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=matrix2)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, matrix1.nbytes )
prg = cl.Program(ctx, """
__kernel void multiplymatrices(const unsigned int size, __global float * matrix1, __global float * matrix2, __global float * res) {
int i = get_global_id(1);
int j = get_global_id(0);
res[i + size * j] = 0;
for (int k = 0; k < size; k++)
{
res[i + size * j] += matrix1[i + size * k] * matrix2[k + size * j];
}
}
""").build()
t0 = datetime.datetime.now()
prg.multiplymatrices(queue, matrix1.shape, None,np.int32(len(matrix1)) ,a_buf, b_buf, dest_buf)
final_matrix = np.empty_like(matrix1)
cl.enqueue_copy(queue, final_matrix , dest_buf)
print final_matrix
delta_t = datetime.datetime.now() - t0
print 'OpenCL Multiplication: ' + str(delta_t)
return final_matrix
Thank you!
Well, I think the kernel does all right. I can even call script result correct. It all depends on how you treat your matrices :-) If you want your expected result. I'd change this:
res[i + size * j] += matrix1[i + size * k] * matrix2[k + size * j];
to this:
res[i + size * j] += matrix1[k + size * i] * matrix2[j + size * k];
Hope this helps.