[SOLVED] Equivalent of cudaSetDevice in OpenCL?

Equivalent of cudaSetDevice in OpenCL?

I have a function that I wrote for 1 gpu, and it runs for 10 seconds with one set of args, and I have a very long list of args to go through. I would like to use both my AMD gpus, so I have some wrapper code that launches 2 threads, and runs my function on thread 0 with an argument gpu_idx 0 and on thread 1 with an argument gpu_idx 1.

I have a cuda version for another machine, and I just run checkCudaErrors(cudaSetDevice((unsigned int)device_id)); to get my desired behavior.

With openCL I have tried to do the following:

void createDevice(int device_idx)
{
    cl_device_id *devices;
    ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
    HANDLE_CLERROR_G(ret);
    ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_ALL, 0, NULL, &ret_num_devices);
    HANDLE_CLERROR_G(ret);
    devices = (cl_device_id*)malloc(ret_num_devices*sizeof(cl_device_id));
    ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_ALL, ret_num_devices, devices, &ret_num_devices);
    HANDLE_CLERROR_G(ret);
    if (device_idx >= ret_num_devices)
    {
        fprintf(stderr, "Found %i devices but asked for device at index %i\n", ret_num_devices, device_idx);
        exit(1);
    }
    
    device_id = devices[device_idx];
    // usleep(((unsigned int)(500000*(1-device_idx)))); // without this line multithreaded 2 gpu execution does not work.

    context = clCreateContext( NULL, 1, &device_id, NULL, NULL, &ret);
    HANDLE_CLERROR_G(ret);
}

context is a static variable in my *c file that I then use later again when I create the kernel.

This code works when I run only with device_idx 0, or only with device_idx 1, and even if I manually in two terminal windows run the executable "simultaneously" with device_idx 0 and device_idx 1.

BUT, there is something about the threads being "too" concurrent that prevents this code from working. In fact, depending on the amount of sleep (commented above), I get different behavior (sometimes both threads do work on gpu 0, sometimes both threads do work on gpu 1, sometimes threads are balanced on both gpus). If I sleep for too little time, I either get: CL_INVALID_CONTEXT and if I don't sleep at all I get CL_INVALID_KERNEL_NAME.

Like I said, I don't get any errors when running on gpu 0 or gpu 1 alone, only when spawning multiple threads that call this code (as an *so with an extern C function from go) simultaneously with device_idx 0 in thread 0 and device_idx 1 in thread 1.

How can I solve my problem? I am attached to the idea that I have an executable that works on 1 gpu, for which I specify which gpu, and that specification should be respected.

What is the proper way to pick the device when both devices need to be used, one completely separate from the other?

Solution

Whoops! Instead of saving device_id into a static variable I started returning from the above code and using it as a local variable, and everything works as expected, and is now thread safe.