I am learning myself openCL in Java using the jogamp jocl libraries. One of my tests is ceating a Mandelbrot map. I have four tests: simple serial, parallel using the Java executor interface, openCL for a single device and openCL for multiple devices. The first three are ok, the last one not. When I compare the (correct) output of the multiple device with the incorrect output of the multiple device solution I notice that the colors are about the same but that the output of the last one is garbled. I think I understand where the problem resides but I can't solve it.
The trouble is (imho) in the fact that openCL uses vector buffers and that I have to translate the output into a matrix. I think that this translation is incorrect. I parallize the code by dividing the mandelbrot map into rectangles where the width (xSize) is divided by the number of tasks and the height (ySize) is preserved. I think I am able to transmit that info correct into the kernel, but translating it back is incorrect.
CLMultiContext mc = CLMultiContext.create (deviceList);
try
{
CLSimpleContextFactory factory = CLQueueContextFactory.createSimple (programSource);
CLCommandQueuePool<CLSimpleQueueContext> pool = CLCommandQueuePool.create (factory, mc);
IntBuffer dataC = Buffers.newDirectIntBuffer (xSize * ySize);
IntBuffer subBufferC = null;
int tasksPerQueue = 16;
int taskCount = pool.getSize () * tasksPerQueue;
int sliceWidth = xSize / taskCount;
int sliceSize = sliceWidth * ySize;
int bufferSize = sliceSize * taskCount;
double sliceX = (pXMax - pXMin) / (double) taskCount;
String kernelName = "Mandelbrot";
out.println ("sliceSize: " + sliceSize);
out.println ("sliceWidth: " + sliceWidth);
out.println ("sS*h:" + sliceWidth * ySize);
List<CLTestTask> tasks = new ArrayList<CLTestTask> (taskCount);
for (int i = 0; i < taskCount; i++)
{
subBufferC = Buffers.slice (dataC, i * sliceSize, sliceSize);
tasks.add (new CLTestTask (kernelName, i, sliceWidth, xSize, ySize, maxIterations,
pXMin + i * sliceX, pYMin, xStep, yStep, subBufferC));
} // for
pool.invokeAll (tasks);
// submit blocking immediately
for (CLTestTask task: tasks) pool.submit (task).get ();
// Ready read the buffer into the frequencies matrix
// according to me this is the part that goes wrong
int w = taskCount * sliceWidth;
for (int tc = 0; tc < taskCount; tc++)
{
int offset = tc * sliceWidth;
for (int y = 0; y < ySize; y++)
{
for (int x = offset; x < offset + sliceWidth; x++)
{
frequencies [y][x] = dataC.get (y * w + x);
} // for
} // for
} // for
pool.release();
The last loop is the culprit, meaning that there is (i think) a mismatch between the kernel encoding and host translation. The kernel:
kernel void Mandelbrot
(
const int width,
const int height,
const int maxIterations,
const double x0,
const double y0,
const double stepX,
const double stepY,
global int *output
)
{
unsigned ix = get_global_id (0);
unsigned iy = get_global_id (1);
if (ix >= width) return;
if (iy >= height) return;
double r = x0 + ix * stepX;
double i = y0 + iy * stepY;
double x = 0;
double y = 0;
double magnitudeSquared = 0;
int iteration = 0;
while (magnitudeSquared < 4 && iteration < maxIterations)
{
double x2 = x*x;
double y2 = y*y;
y = 2 * x * y + i;
x = x2 - y2 + r;
magnitudeSquared = x2+y2;
iteration++;
}
output [iy * width + ix] = iteration;
}
The last statement encodes the information into the vector. This kernel is used by the single device version as well. The only difference is that in the multi device version I changed the width and x0. As you can see in the Java code I transmit xSize / number_of_tasks
as width and pXMin + i * sliceX
as x0 (instead of pXMin).
I am working at it for several days now and have removed quite some bugs, but I am not able to see anymore what I am doing wrong now. Help is greatly appreciated.
Edit 1
@Huseyin asked for an image. First screenshot computed by openCL single device.
Second screenshot is the multi device version, computed with exactly the same parameters.
Edit 2
There was a question about how I enqueue the buffers. As yoy can see in the code above I have a list<CLTestTask>
to which I add the tasks and in which the buffer is enqueued. CLTestTask is an inner class of which you can find the code below.
final class CLTestTask implements CLTask { CLBuffer clBufferC = null; Buffer bufferSliceC; String kernelName; int index; int sliceWidth; int width; int height; int maxIterations; double pXMin; double pYMin; double x_step; double y_step;
public CLTestTask
(
String kernelName,
int index,
int sliceWidth,
int width,
int height,
int maxIterations,
double pXMin,
double pYMin,
double x_step,
double y_step,
Buffer bufferSliceC
)
{
this.index = index;
this.sliceWidth = sliceWidth;
this.width = width;
this.height = height;
this.maxIterations = maxIterations;
this.pXMin = pXMin;
this.pYMin = pYMin;
this.x_step = x_step;
this.y_step = y_step;
this.kernelName = kernelName;
this.bufferSliceC = bufferSliceC;
} /*** CLTestTask ***/
public Buffer execute (final CLSimpleQueueContext qc)
{
final CLCommandQueue queue = qc.getQueue ();
final CLContext context = qc.getCLContext ();
final CLKernel kernel = qc.getKernel (kernelName);
clBufferC = context.createBuffer (bufferSliceC);
out.println (pXMin + " " + sliceWidth);
kernel
.putArg (sliceWidth)
.putArg (height)
.putArg (maxIterations)
.putArg (pXMin) // + index * x_step)
.putArg (pYMin)
.putArg (x_step)
.putArg (y_step)
.putArg (clBufferC)
.rewind ();
queue
.put2DRangeKernel (kernel, 0, 0, sliceWidth, height, 0, 0)
.putReadBuffer (clBufferC, true);
return clBufferC.getBuffer ();
} /*** execute ***/
} /*** Inner Class: CLTestTask ***/
You are creating sub-buffers with
subBufferC = Buffers.slice (dataC, i * sliceSize, sliceSize);
and they have memory data as:
0 1 3 10 11 12 19 20 21 28 29 30
4 5 6 13 14 15 22 23 24 31 32 33
7 8 9 16 17 18 25 26 27 34 35 36
by using rectangle copy commands of opencl? If so, then you are accessing them out-of-bounds with
output [iy * width + ix] = iteration;
because width
is bigger than sliceWidth
and writes to out bounds in the kernel.
If you are not doing rectangle copies or subbuffers and simply taking an offset from the original buffer, then it has a memory layout like
0 1 3 4 5 6 7 8 9 | 10 11 12
13 14 15 16 17 18|19 20 21 22 23 24
25 26 27|28 29 30 31 32 33 34 35 36
so the arrays are accessed/interpreted as skewed or computed wrong.
You are giving offset as a parameter of kernel. But you could do it from the kernel enqueue parameters too. So i and j would start from their true values(instead of zero) and you wouldn't need to add x0 or y0 to them in kernel for all threads.
I've written a multi device api before. It is using multiple buffers, one for each device and they are all equal in size to main buffer. And they just copy the necessary parts(their own territory) to/from main buffer(host buffer) so kernel calculations stay totally same with all devices, with using proper global range offsets. Bad side of this is, main buffer is literally duplicated on all devices. If you have 4 gpus and 1GB data, you need 4GB buffer area in total. But this way, kernel ingredients are much more easier to read, no matter how many devices are being used.
If you allocate only 1/N sized buffers per device(out of N devices), then you need to copy from 0th address of subbuffer to i*sliceHeight
of main buffer where i is device index, considering arrays are flat so need rectangle buffer copy command of opencl api for each device. I suspect you are using flat arrays too and using rectangle copies and overflowing-out-of-bounds in the kernel. Then I suggest:
if whole data can't fit in a device, you can try mapping/unmapping so it doesn't allocate much in background. In its page it says:
Multiple command-queues can map a region or overlapping regions of a memory object for reading (i.e. map_flags = CL_MAP_READ). The contents of the regions of a memory object mapped for reading can also be read by kernels executing on a device(s). The behavior of writes by a kernel executing on a device to a mapped region of a memory object is undefined. Mapping (and unmapping) overlapped regions of a buffer or image memory object for writing is undefined.
and it doesn't say, "non-overlapped mappings for read/write are undefined" so you should be okay to have mappings on each device for a concurrent read/write on target buffer. But when used with USE_HOST_PTR flag(for max streaming performance), each subbuffer may need to have an aligned pointer to start with, which could make it harder to split area into proper chunks. I'm using same whole data array for all devices so its not a problem to divide work since I can map unmap any address within an aligned buffer.
Here is 2-device result with 1-D division(upper part by cpu, lower part by gpu):
and this is inside of kernel:
unsigned ix = get_global_id (0)%w2;
unsigned iy = get_global_id (0)/w2;
if (ix >= w2) return;
if (iy >= h2) return;
double r = ix * 0.001;
double i = iy * 0.001;
double x = 0;
double y = 0;
double magnitudeSquared = 0;
int iteration = 0;
while (magnitudeSquared < 4 && iteration < 255)
{
double x2 = x*x;
double y2 = y*y;
y = 2 * x * y + i;
x = x2 - y2 + r;
magnitudeSquared = x2+y2;
iteration++;
}
b[(iy * w2 + ix)] =(uchar4)(iteration/5.0,iteration/5.0,iteration/5.0,244);
Took 17ms with FX8150(7 cores at 3.7GHz) + R7_240 at 700 MHz for a 512x512 sized image(8 bit per channel + alpha)
Also having subbuffers equal size to host buffer makes it faster(no re-allocations) to use dynamic ranges rather than static(in case of heterogeneous setup, dynamic turbo frequencies and hiccups/throttles), to help dynamic load balancing. Combined with power of "same codes same parameters", it doesn't incur performance penalty. For example, c[i]=a[i]+b[i]
would need c[i+i0]=a[i+i0]+b[i+i0]
to work on multiple devices if all kernels start from zero and would add more cycles(apart from memory bottleneck and readability and weirdness of distributing c=a+b).