[SOLVED] Need help adding OpenCL(GPU Usage)

Need help adding OpenCL(GPU Usage)

Well I decided I prefer to use GPU over CPU especially since I'm working on a game and FPS will increase I expect. The thing is i'm not sure where to start. I can easily implement JOCL or JCUDA but after that I wouldnt know where to replace it from using CPU to GPU. Help is appreciated :)

Solution

What kind of computations are you after? If those are compute intensive such as N-body gravity experiments, then you can simply copy variables to gpu then compute then copy results back to main memory.

If your objects have big data but small computation such as fluid dynamics or collision detection, then you should add interoperability between your graphics api and compute api. Then you can do only computations withouth any copying of data.(speed-up is like your GPU ram bandwidth divided by your pci-e bandwidth. For a HD7870, it is like 25x if compute power is not saturated already)

I used jocl and lwjgl using gl/cl interoperability in java and they were working very well.

Some neural network is trained with CPU(Encog) but used by GPU(jocl) to generate a map and drawn by LWJGL :(neuron weigths are changed a little to have some more randomizing effect)

enter image description here

Very important part is:

Start a GL context.
Use the GL context's handle variables to start an inter-operable CL context
Create GL buffers
Create CL buffers with the interoperable cl context.
Dont forget calling clFinish() when opencl is done and gl is ready to start
Dont forget calling glFinish() when opengl is done and cl is ready to start
Using an opencl kernel builder/table class and a buffer scheduler class would help when you have tens of different kernels many different buffers between gl and cl and you need them run in an order.

Example:

 // clh is a fictional class that binds oepncl to opengl through interoperability
 // registering needed kernels to this object
 clh.addKernel(
               kernelFactory.fluidDiffuse(1024,1024),  // enumaration is fluid1
               kernelFactory.fluidAdvect(1024,1024),   // enumeration is fluid2
               kernelFactory.rigidBodySphereSphereInteracitons(2048,32,32), 
               kernelFactory.fluidRigidBodyInteractions(false), // fluidRigid
               kernelFactory.rayTracingShadowForFluid(true),
               kernelFactory.rayTracingBulletTargetting(true),
               kernelFactory.gravity(G),
               kernelFactory.gravitySphereSphere(), // enumeration is fall
               kernelFactory.NNBotTargetting(3,10,10,2,numBots) // Encog
               );

 clh.addBuffers(
         // enumeration is buf1 and is used as fluid1, fluid2 kernels' arguments
               bufferFactory.fluidSurfaceVerticesPosition(1024,1024, fluid1, fluid2),
        // enumeration is buf2, used by fluid1 and fluid2
               bufferFactory.fluidSurfaceColors(1024,1024,fluid1, fluid2),
        // enumeration is buf3, used by network
               bufferFactory.NNBotTargetting(numBots*25, Encog)
               )

 Running kernels:

 // shortcut of a sequence of kernels
 int [] fluidCalculations = new int[]{fluid1,fluid2,fluidRigid, fluid1} 

 clh.run(fluidCalculations); // runs the registered kernels
 // diffuses, advects, sphere-fluid interaction, diffuse again

 //When any update of GPU-buffer from main-memory is needed:

 clh.sendData(cpuBuffer, buf1); // updates fluid surface position from main-memory.

Changing a cpu code to a opencl code can be done automatically by APARAPI but Im not sure if it has interoperability.

If you need to do it yourself, then it is as easy as:

 From Java:

 for(int i=0;i<numParticles;i++)
 {
     for(int j=0;j<numParticles;j++)
       {

           particle.get(i).calculateAndAddForce(particle.get(j));
       }
 }


 To a Jocl kernel string(actually very similar to calculateAndAddForce's inside):

   "__kernel void nBodyGravity(__global float * positions,__global float *forces)" +
                "{" +
                "    int indis=get_global_id(0);" +
                "    int totalN=" + n + "; "+            
                "    float x0=positions[0+3*(indis)];"+
                "    float y0=positions[1+3*(indis)];"+
                "    float z0=positions[2+3*(indis)];"+
                "    float fx=0.0f;" +
                "    float fy=0.0f;" +
                "    float fz=0.0f;" +
                "    for(int i=0;i<totalN;i++)" +
                "    { "+
                "       float x1=positions[0+3*(i)];" +
                "       float y1=positions[1+3*(i)];" +
                "       float z1=positions[2+3*(i)];" +

                "       float dx = x0-x1;" +
                "       float dy = y0-y1;" +
                "       float dz = z0-z1;" +
                "       float r=sqrt(dx*dx+dy*dy+dz*dz+0.01f);" +
                "       float tr=0.1f/r;" +
                "       float tr2=tr*tr*tr;" +
                "       fx+=tr2*dx*0.0001f;" +
                "       fy+=tr2*dy*0.0001f;" +
                "       fz+=tr2*dz*0.0001f;" +

                "    } "+


                "    forces[0+3*(indis)]+=fx; " +
                "    forces[1+3*(indis)]+=fy; " +
                "    forces[2+3*(indis)]+=fz; " +

               "}"