cgpuopenclopencl-c

Safe GPU Programming


I recently learned how to program my AMD GPU using OpenCL in C. However, if I give the GPU a task that is too demanding, my entire system stops working properly and I have to reboot. I am using linux (more specifically, manjaro). How do I make sure my program leaves barely enough GPU power for other applications? The code only has to run on my machine. This is my current code:

#define CL_TARGET_OPENCL_VERSION 300

#include <CL/cl.h> // Include OpenCL headers
#include <stdio.h>
#include <limits.h>

int main() {
    cl_device_id device;
    cl_context context;
    cl_command_queue queue;
    cl_program program;
    cl_kernel kernel;
    cl_mem buffer;
    // create data
    const int DATA_SIZE = 1000000;
    float data[DATA_SIZE];
    int count;
    for(count = 0; count < DATA_SIZE; count++) data[count] = count;


    // Setup OpenCL
    clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
    context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
    queue = clCreateCommandQueueWithProperties(context, device, NULL, NULL);

    // Define our kernel. It just calculates the sin of the input data.
    char *source = {
        "kernel void calcSin(global float *data) {\n"
        "   int id = get_global_id(0);\n"
        "   for (int i = 0; i < 400000; i++) {\n"
        "       data[id] = sin(data[id]);\n"
        "   }\n"
        "}\n"
    };

    // Compile the kernel
    program = clCreateProgramWithSource(context, 1, (const char**)&source, NULL, NULL);
    clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
    kernel = clCreateKernel(program, "calcSin", NULL);

    // Create the memory object
    if (context == NULL) {
        printf("context is null\n");
        return 0;
    } else {
        printf("context is not null\n");
    }

    buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float) * DATA_SIZE, NULL, NULL);

    // Write data to the buffer
    clEnqueueWriteBuffer(queue, buffer, CL_TRUE, 0, sizeof(float) * DATA_SIZE, data, 0, NULL, NULL);

    // Execute the kernel
    const size_t LENGTH = DATA_SIZE;
    clSetKernelArg(kernel, 0, sizeof(buffer), &buffer);
    size_t global_dimensions[] = {LENGTH,0,0};
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_dimensions, NULL, 0, NULL, NULL);

    // Read back the results
    clEnqueueReadBuffer(queue, buffer, CL_FALSE, 0, sizeof(cl_float)*LENGTH, data, 0, NULL, NULL);

    // Wait for everything to finish
    clFinish(queue);

    // Print the result
    // printf("Array of integers:\n");
    // for (int i = 0; i < DATA_SIZE; i++) {
    //     printf("%.2f ", data[i]);
    // }
    // printf("\n");

    // Clean up
    clReleaseMemObject(buffer);
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    return 0;
}


By increasing the amount of sin() operations each core does to 10^6 instead of 4*10^5, my computer needs a reboot. After that reboot, running journalctl -r -b -1 -p 3 displays two errors, both starting with apr 19 18:26:22 [my username] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*.

I tried running the program with a higher niceness value through nice -n 19 ./path/to/file. However, that did not solve the issue.


Solution

  • The problem is that your kernel contains a looooong loop of 400000 iterations that are computed in serial manner on every core of your GPU redundantly. You are computing x = sin(sin(sin(sin(...sin(x)...)))), 400000 times, which gives you always a result of 0, and in between each sin() you're overwriting the value in VRAM which adds extra slowness. For 1M elements in parallel.

    The runtime of such a long serial loop full of slow trigonometric functions is in the minutes. Your GPU does maybe 5000 of the 1M threads at a time, so total runtime for the kernel is minutes × 200 = several hours.

    Of course this will lock up your system. ;)

    GPU parallelization = split the problem up into as many individual threads as possible. What on the CPU looks like a loop

    for(uint id=0; id<1000000; id++) {
        x[id] = sin(x[id]);
    }
    

    on the GPU becomes this kernel

    kernel void sin_kernel(float* x) {
        const uint id = get_global_id(0);
        x[id] = sin(x[id]);
    }
    

    where all iterations are computed concurrently and in random order. That means: every GPU thread computes only a single iteration of the loop, or x[id] = sin(x[id]). Remove the loop from your kernel. The kernel is already parallelized over 1M elements; this range is passed as global range to the kernel on the host side.