Using multiple GPUs OpenCL

I have a loop within which I am launching multiple kernels onto a GPU. Below is the snippet:

for (int idx = start; idx <= end ;idx ++) {

            ret = clEnqueueNDRangeKernel(command_queue, memset_kernel, 1, NULL,
                                            &global_item_size_memset, &local_item_size, 0, NULL, NULL);
            ASSERT_CL(ret, "Error after launching 1st memset_kernel !");


            ret = clEnqueueNDRangeKernel(command_queue, cholesky_kernel, 1, NULL,
                                                    &global_item_size_cholesky, &local_item_size, 0, NULL, NULL);
            ASSERT_CL(ret, "Error after launching 1st cholesky_kernel !");


            ret = clEnqueueNDRangeKernel(command_queue, ckf_kernel1, 1, NULL,
                                            &global_item_size_kernel1, &local_item_size, 0, NULL,  NULL);
            ASSERT_CL(ret, "Error after launching ckf_kernel1[i] !");



            clFinish(command_queue);
            ret = clEnqueueNDRangeKernel(command_queue, memset_kernel, 1, NULL,
                                            &global_item_size_memset, &local_item_size, 0, NULL, NULL);
            ASSERT_CL(ret, "Error after launching 2nd memset_kernel !");


            ret = clEnqueueNDRangeKernel(command_queue, cholesky_kernel, 1, NULL,
                                                    &global_item_size_cholesky, &local_item_size, 0, NULL, NULL);
            ASSERT_CL(ret, "Error after launching 2nd cholesky_kernel !");


            ret = clSetKernelArg(ckf_kernel2, 4, sizeof(idx), (void *)&idx);

            ret = clEnqueueNDRangeKernel(command_queue, ckf_kernel2, 1, NULL,
                                            &global_item_size_kernel2, &local_item_size, 0, NULL, NULL);
            ASSERT_CL(ret, "Error after launching ckf_kernel2 !");

Now, I am wanting to use this code for a system which has multiple GPUs. So I have completed the following steps:

created a single context for all the GPUs.
created one command queue per device.
created separate kernels for each device (code snippet below assuming two gpus)

allocated separate device buffers for each device

cl_kernel ckf_kernel1[2];
cl_kernel ckf_kernel2[2];
cl_kernel cholesky_kernel[2];
cl_kernel memset_kernel[2];

// read get kernel.
ckf_kernel1[0] = clCreateKernel(program, "ckf_kernel1", &ret);
ASSERT_CL(ret, "Cannot load ckf_kernel1[i]!");
ckf_kernel2[0] = clCreateKernel(program, "ckf_kernel2", &ret);
ASSERT_CL(ret, "Cannot load ckf_kernel2!");
memset_kernel[0] = clCreateKernel(program, "memset_zero", &ret);
ASSERT_CL(ret, "Cannot load memset_kernel!");
cholesky_kernel[0] = clCreateKernel(program, "cholesky_kernel", &ret);
ASSERT_CL(ret, "Cannot load cholesky_kernel!");

ckf_kernel1[1] = clCreateKernel(program, "ckf_kernel1", &ret);
ASSERT_CL(ret, "Cannot load ckf_kernel1[i]!");
ckf_kernel2[1] = clCreateKernel(program, "ckf_kernel2", &ret);
ASSERT_CL(ret, "Cannot load ckf_kernel2!");
memset_kernel[1] = clCreateKernel(program, "memset_zero", &ret);
ASSERT_CL(ret, "Cannot load memset_kernel!");
cholesky_kernel[1] = clCreateKernel(program, "cholesky_kernel", &ret);
ASSERT_CL(ret, "Cannot load cholesky_kernel!");

Now, I am not sure how to launch the kernels onto the different devices within the loop. How to get them to execute in parallel? Please note that there is a clFinish command within the loop above.

Another question: is it standard practice to use multiple threads/processes on the host where each thread/process is responsible for launching kernels on a single GPU?

Solution

You need not create separate contexts for all the devices. You only need to that if they are from different platforms.
You need not create separate kernels either. You can compile your kernels for multiple devices at the same time (clBuildProgram supports multi-device compilation), and if you launch a kernel on a device, the runtime will know that the kernel entity holds device binary valid for the given device or not.
Easiest thing is: create a context, fetch all devices that you need, place then in an array, and use that array for building your kernels, and create one command_queue for every device in them.
clEnqueueNDRange kernel is non-blocking. The only reason why your for loop doesn't dash through is because of the clFinish() statemen, and most likely because you are using in order queue, which means that the single device case would work fine without clFinish too.

The general idea for best usage of multi-GPU in OpenCL, is create context-kernels-queues the way I mentioned, and make the queues out-of-order. That way commands are allowed to execute in parallel, if they don't have unmet dependencies, for eg. the input of command2 is not the output of command1, then it is free to start executing in parallel to command1. If you are using this method however, you HAVE to use the final few parameters to clEnqueueNDRangeKernels, because you have to build this chain of dependencies using cl_events. Every clEnqueueWhatever can wait on an array of events, which originate from other commands. Execution of a command in the queue will only start once all it's dependencies are met.

There is one issue that you have not touched upon, and that is the idea of buffers. If you want to get multi-GPU running, you need to explicitly create buffers for your devices separately, and partition your data. It is not valid to have the same buffer set as argument on 2 devices, while both of them are trying to write it. At best, the runtime will serialize your work, and the 2 devices will not work in parallel. This is because buffers are handles to memory, and the runtime is responsible for moving the contents of the buffer to the devices that need it. (This can happen implicitly (lazy memory movement), or explicitly if you call clEnqueueMigrateBuffer.) The runtime is forbidden to give the same buffer with CL_MEM_READ_WRITE or CL_MEM_WRITE_ONLY flags to 2 devices simultaneously. Even though you know as the programmer, that the 2 devices might not be writing the same part of the buffer, the runtime does not. You have to tell it. Elegant way is to create 2 sub-buffers, that are part of the larger/original buffer; less elegant way is to simply create 2 buffers. The first approach is better, because it is easier to collect data from multiple devices back to host, because you need to fetch only the large buffer, and the runtime will know which sub-buffers have been modified on which devices, and it will take care of collecting the data.

If I saw your clSetKernelArgument calls, and the buffers you are using, I could see what the dependencies are to your kernels and write out what you need to do, but I think this is a fairly good start for you in getting multi-device running. Ultimately, it's all about the data. (And start using out-of-order queues, because it has the potential to be faster, and it forces you to start using events, which make it explicit to you and anyone reading the code, which kernels are allowed to run in parallel.