macosopenglopenclnsopenglview

OpenGL / OpenCL Interop Performance in glBindTexture(), glBegin()


I'm working on an OS X app in a multi-GPU setup (Mac Pro late-2013) that uses OpenCL (on the secondary GPU) to generate a texture which is later drawn to the screen with OpenGL (on the primary GPU). The app is CPU-bound due to calls to glBindTexture() and glBegin(), both of which are spending basically all of their time in:

_platform_memmove$VARIANT$Ivybridge

which is a part of the video driver:

AMDRadeonX4000GLDriver

Setup: creates the OpenGL texture (glPixelBuffer) and then its OpenCL counterpart (clPixelBuffer).

cl_int clerror = 0;
GLuint glPixelBuffer = 0;
cl_mem clPixelBuffer = 0;

glGenTextures(1, &glPixelBuffer);
glBindTexture(GL_TEXTURE_2D, glPixelBuffer);
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 2048, 2048, 0, GL_RGBA, GL_FLOAT, NULL);
glBindTexture(GL_TEXTURE_2D, 0);

clPixelBuffer = clCreateFromGLTexture(_clShareGroupContext, CL_MEM_WRITE_ONLY, GL_TEXTURE_2D, 0, glPixelBuffer, &clerror);

Drawing code: maps the OpenGL texture onto the viewport. The entire NSOpenGLView is just this one texture.

glClear(GL_COLOR_BUFFER_BIT);

glBindTexture(GL_TEXTURE_2D, _glPixelBuffer);  // <- spends cpu time here,
glBegin(GL_QUADS);                             // <- and here
glTexCoord2f(0., 0.); glVertex3f(-1.f,  1.f, 0.f);
glTexCoord2f(0., hr); glVertex3f(-1.f, -1.f, 0.f);
glTexCoord2f(wr, hr); glVertex3f( 1.f, -1.f, 0.f);
glTexCoord2f(wr, 0.); glVertex3f( 1.f,  1.f, 0.f);
glEnd();
glBindTexture(GL_TEXTURE_2D, 0);

glFlush();

After gaining control of the texture memory (via clEnqueueAcquireGLObjects()), the OpenCL kernel writes data to the texture and then releases control of it (via clEnqueueReleaseGLObjects()). The texture data should never exist in main memory (if I understand all of this correctly).

My question is: is it expected that so much CPU time is spent in memmove()? Is it indicative of a problem in my code? Or a bug in the driver, perhaps? My (unfounded) suspicion is that the texture data is moving via: GPUx -> CPU/RAM -> GPUy, which I'd like to avoid.


Solution

  • Before I touch on the memory transfer, my first observation is that you're using clBegin() which is not going to be your best friend because

    1) This direct drawing does not work well with the driver. Use VBOs, etc. instead so this data can live on the GPU.

    2) On OS X it means you're in their old compatibility context rather than the new core context. As (I understand) the new context is a complete rewrite this is where future optimizations will end up while the context you're using is (probably) simply being maintained.

    So to the memory transfer.... on the GL side are you putting in glCreateSyncFromCLeventARB() and glWaitSync() on that? There should be no need for the glFlush() I see in your code. Once you've got rid of the immediate mode drawing (as mentioned above) and are using sync objects between the two APIs your host code should be doing nothing (except asking the driver to tell the GPU to do things). This will give you your best chance of having speedy buffer copies....

    Yes, copies :( Because your CL texture physically lives on a different piece of GPU memory to the GL texture there will have to be a copy over PCIe bus which will be slow(er). This is what you're seeing in your profiling. What's actually happening is that the CPU is mapping GPU memory A and GPU memory B into pinned host memory and then copying between them (hopefully) with a DMA. I doubt the data actually touches system memory so the move is GPUx -> GPUy.

    Try putting your CL and GL contexts on the same GPU and I think you'll see your transfer time disappear.

    Final thought: if your CL compute is being dwarfed by the transfer time it's probably best to stick the contexts on the same CPU. You've got the classic CPU/GPU task split problem.