OpenCL kernel START delays when using multiple GPUs?

I have an application that I designed to run on AMD GPU's with OpenCL. Finally got the app running and bug free (haha) yesterday, targted on a single GPU. Now that the app works, it's time to scale it to multiple GPUs.

Read a lot about how to set it up. We're using the single context, multiple queue method.

I pull the list of devices, and choose 2 of the GPUs and create a context containing them, then a single BuildProgram containing both devices as well. Create two separate queues.

Pseudocode of the original, working app, now converted to handle 2 gpus:

context = clCreateContext(0, 2, device_list, NULL, NULL, &ret);
for(x = 0; x < 2; x++)
  queue[x] = clCreateCommandQueue(context, device_list[x], ENABLE_PROFILING, &ret);
clBuildProgram(program, 2, device_list, options, NULL, NULL);

create kernels..

run...
for(outer_loop = 0; outer_loop < 10; outer_loop++) {
  clEnqueueNDRangeKernel(queue[0], kernel_init, offset, &event[0]);
  clEnqueueNDRangeKernel(queue[1], kernel_init, different_offset, &event[1]);
  clFinish(queue[0]);
  clFinish(queue[1]);

  get profiling data and printf results
}

Thats basically how the code looks. Arguments are set and Writes are done BEFORE the loop - the init kernel does not rely on input to start working. After it runs, it DOES do a async_work_group_copy of it's generated data to a global buffer.

Now, before I modified the code for 2 GPUs, the kernel ran in 27ms (for each loop)

After I modified the code, if I comment out ONE or the OTHER of the 2 kernel runs (the EnqueueNDRangeKernel and the associated clFinish), they will both run in 27ms.

If I run the code to run on both GPUs in parallel, I get very odd behavior.

The first run in the loop, they both execute in about 37-42ms individually. I'm ok with a slight slowdown, as I'm getting twice the work done. But after the first run, one or the other kernel will randomly have a 4-5 SECOND delay between being queued, and starting.

Here's the output of my profiling/timing for it. All numbers are in ms.

Q0: til sub:  8.8542  til start: 9.8594 til fin: 47.3749
Q1: til sub:  0.0132  til start: 13.4089 til fin: 39.2364

Q0: til sub:  0.0072  til start: 0.2310 til fin: 37.1187
Q1: til sub:  0.0122  til start: 4152.4638 til fin: 4727.1146

Q0: til sub:  0.0302  til start: 488.6218 til fin: 5049.7233
Q1: til sub:  0.0179  til start: 5023.9310 til fin: 5049.7762

Q0: til sub:  0.0190  til start: 2.0987 til fin: 39.4356
Q1: til sub:  0.0164  til start: 3996.2654 til fin: 4571.5866

Q0: til sub:  0.0284  til start: 488.5751 til fin: 5046.3555
Q1: til sub:  0.0176  til start: 5020.5919 til fin: 5046.4382

The machine I'm running this on has 5 GPUs in it. Regardless of which two I use, one of the two GPUs (its not always the same one) gets a 4-5 second delay on starting. Use a single GPU - no delay.

What could be causing this? Any idea? I'm not blocking - the clFinish is only to get profiling info. Even if it were blocking it wouldn't be a 5 second delay.

Also - I thought maybe the writes to global that the kernel was doing might have been part of the delay. I commented the writes out. Nope. No change.

In fact, I added a return; as the first line of the kernel - so it does absolutely nothing. The 40ms dropped to .25, but the 5 second delay was still there.

Solution

The OpenCL driver does not care about what happens in the kernel. If the kernel writes/reads or is a null kernel, or if it only writes to one section of the buffer. It cares about the buffer parameter flags, and ensures the data is consistent across GPUs, blocking the kernels if they have ANY dependency in other kernels. GPU to GPU transfer occurs transparently and can be very costly.

When using multiple GPUs the hidden data copy and synchronization has to be taken seriously, since that is usually the main bottleneck.

If your kernels can run in parallel (because GPU1 works on different data that the one on GPU2, and so on...), then you should create different buffers for each GPU. Or if the data is the same, set the types CL_READ_ONLY/CL_WRITE_ONLY properly, to ensure proper OpenCL behaviour. And minimum copy/consistency operations.

For example for these kernels:

kernel Sum(read_only A, read_only B, write_only C);
kernel Sum_bad(read_write A, read_write B, write_only C);

If you use a single GPU, both will behave exactly the same, because all the memory resides in the same GPU. But using multiple GPUs can cause terrible problems, for example:

Queue 1/GPU 1: Sum_Bad(A,B,C);
Queue 2/GPU 2: Sum_Bad(A,D,E);

The events will occur as follow:

Memory A,B will be copied to GPU1 memory (if it was not there already). C memory allocated in GPU1.
GPU 1 will run kernel.
Memory A will be copied from GPU1 to GPU2. Memory D will be copied to GPU2. Memory E allocated.
GPU2 will run kernel.

As you see, the GPU2 has to wait for the first one to finish, and additionally wait for all the parameters to copy back. (Can that be 5s? maybe, depending on the sizes)

However using the correct approach:

Queue 1/GPU 1: Sum(A,B,C);
Queue 2/GPU 2: Sum(A,D,E);

The events will occur as follow:

Memory A,B will be copied to GPU1 memory (if it was not there already). C memory allocated in GPU1.
GPU 1 will run kernel.

In parallel (because there is no dependancy)

Memory A,D will be copied to GPU2 (if it was not there already). Memory E allocated.
GPU2 will run kernel.