Bind CUDA output array/surface to GL texture in ManagedCUDA

I'm currently attempting to connect some form of output from a CUDA program to a GL_TEXTURE_2D for use in rendering. I'm not that worried about the output type from CUDA (whether it'd be an array or surface, I can adapt the program to that).

So the question is, how would I do that? (my current code copies the output array to system memory, and uploads it to the GPU again with GL.TexImage2D, which is obviously highly inefficient - when I disable those two pieces of code, it goes from approximately 300 kernel executions per second to a whopping 400)

I already have a little bit of test code, to at least bind a GL texture to CUDA, but I'm not even able to get the device pointer from it...

ctx = CudaContext.CreateOpenGLContext(CudaContext.GetMaxGflopsDeviceId(), CUCtxFlags.SchedAuto);

uint textureID = (uint)GL.GenTexture(); //create a texture in GL
GL.TexParameter(TextureTarget.Texture2D, TextureParameterName.TextureMinFilter, (int)TextureMinFilter.Linear);
GL.TexParameter(TextureTarget.Texture2D, TextureParameterName.TextureMagFilter, (int)TextureMagFilter.Linear);
GL.TexImage2D(TextureTarget.Texture2D, 0, PixelInternalFormat.Rgba, width, height, 0, OpenTK.Graphics.OpenGL.PixelFormat.Rgba, PixelType.UnsignedByte, null); //allocate memory for the texture in GL

CudaOpenGLImageInteropResource resultImage = new CudaOpenGLImageInteropResource(textureID, CUGraphicsRegisterFlags.WriteDiscard, CudaOpenGLImageInteropResource.OpenGLImageTarget.GL_TEXTURE_2D, CUGraphicsMapResourceFlags.WriteDiscard); //using writediscard because the CUDA kernel will only write to this texture

//then, as far as I understood the ManagedCuda example, I have to do the following when I call my kernel
//(done without a CudaGraphicsInteropResourceCollection because I only have one item)
resultImage.Map();
var ptr = resultImage.GetMappedPointer(); //this crashes
kernelSample.Run(ptr); //pass the pointer to the kernel so it knows where to write
resultImage.UnMap();

The following exception is thrown when attempting to get the pointer:

ErrorNotMappedAsPointer: This indicates that a mapped resource is not available for access as a pointer.

What do I need to do to fix this?

And even if this exception can be resolved, how would I solve the other part of my question; that is, how do I work with the acquired pointer in my kernel? Can I use a surface for that? Access it as an arbitrary array (pointer arithmetic)?

Edit: Looking at this example, apparently I don't even need to map the resource every time I call the kernel, and call the render function. But how would this translate to ManangedCUDA?

Solution

Thanks to the example I found, I was able to translate that to ManagedCUDA (after browsing the source code and fiddling around), and I'm happy to announce that this does really improve my samples per second from about 300 to 400 :)

Apparently it is needed to use a 3D array (I haven't seen any overloads in ManagedCUDA using 2D arrays) but that doesn't really matter - I just use a 3D array/texture which is exactly 1 deep.

id = GL.GenTexture();
GL.BindTexture(TextureTarget.Texture3D, id);
GL.TexParameter(TextureTarget.Texture3D, TextureParameterName.TextureMinFilter, (int)TextureMinFilter.Linear);
GL.TexParameter(TextureTarget.Texture3D, TextureParameterName.TextureMagFilter, (int)TextureMagFilter.Linear);
GL.TexImage3D(TextureTarget.Texture3D, 0, PixelInternalFormat.Rgba, width, height, 1, 0, OpenTK.Graphics.OpenGL.PixelFormat.Bgra, PixelType.UnsignedByte, IntPtr.Zero); //allocate memory for the texture but do not upload anything

CudaOpenGLImageInteropResource resultImage = new CudaOpenGLImageInteropResource((uint)id, CUGraphicsRegisterFlags.SurfaceLDST, CudaOpenGLImageInteropResource.OpenGLImageTarget.GL_TEXTURE_3D, CUGraphicsMapResourceFlags.WriteDiscard);
resultImage.Map();
CudaArray3D mappedArray = resultImage.GetMappedArray3D(0, 0);
resultImage.UnMap();

CudaSurface surfaceResult = new CudaSurface(kernelSample, "outputSurface", CUSurfRefSetFlags.None, mappedArray); //nothing needs to be done anymore - this call connects the 3D array from the GL texture to a surface reference in the kernel

Kernel code: surface outputSurface;

__global__ void Sample() {
    ...
    surf3Dwrite(output, outputSurface, pixelX, pixelY, 0);
}