I recently learned how to program my AMD GPU using OpenCL in C. However, if I give the GPU a task that is too demanding, my entire system stops working properly and I have to reboot. I am using linux (more specifically, manjaro). How do I make sure my program leaves barely enough GPU power for other applications? The code only has to run on my machine. This is my current code:
#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h> // Include OpenCL headers
#include <stdio.h>
#include <limits.h>
int main() {
cl_device_id device;
cl_context context;
cl_command_queue queue;
cl_program program;
cl_kernel kernel;
cl_mem buffer;
// create data
const int DATA_SIZE = 1000000;
float data[DATA_SIZE];
int count;
for(count = 0; count < DATA_SIZE; count++) data[count] = count;
// Setup OpenCL
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
queue = clCreateCommandQueueWithProperties(context, device, NULL, NULL);
// Define our kernel. It just calculates the sin of the input data.
char *source = {
"kernel void calcSin(global float *data) {\n"
" int id = get_global_id(0);\n"
" for (int i = 0; i < 400000; i++) {\n"
" data[id] = sin(data[id]);\n"
" }\n"
"}\n"
};
// Compile the kernel
program = clCreateProgramWithSource(context, 1, (const char**)&source, NULL, NULL);
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
kernel = clCreateKernel(program, "calcSin", NULL);
// Create the memory object
if (context == NULL) {
printf("context is null\n");
return 0;
} else {
printf("context is not null\n");
}
buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float) * DATA_SIZE, NULL, NULL);
// Write data to the buffer
clEnqueueWriteBuffer(queue, buffer, CL_TRUE, 0, sizeof(float) * DATA_SIZE, data, 0, NULL, NULL);
// Execute the kernel
const size_t LENGTH = DATA_SIZE;
clSetKernelArg(kernel, 0, sizeof(buffer), &buffer);
size_t global_dimensions[] = {LENGTH,0,0};
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_dimensions, NULL, 0, NULL, NULL);
// Read back the results
clEnqueueReadBuffer(queue, buffer, CL_FALSE, 0, sizeof(cl_float)*LENGTH, data, 0, NULL, NULL);
// Wait for everything to finish
clFinish(queue);
// Print the result
// printf("Array of integers:\n");
// for (int i = 0; i < DATA_SIZE; i++) {
// printf("%.2f ", data[i]);
// }
// printf("\n");
// Clean up
clReleaseMemObject(buffer);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(queue);
clReleaseContext(context);
return 0;
}
By increasing the amount of sin() operations each core does to 10^6 instead of 4*10^5, my computer needs a reboot. After that reboot, running journalctl -r -b -1 -p 3
displays two errors, both starting with apr 19 18:26:22 [my username] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
.
I tried running the program with a higher niceness value through nice -n 19 ./path/to/file
. However, that did not solve the issue.
The problem is that your kernel contains a looooong loop of 400000 iterations that are computed in serial manner on every core of your GPU redundantly. You are computing x = sin(sin(sin(sin(...sin(x)...))))
, 400000 times, which gives you always a result of 0
, and in between each sin()
you're overwriting the value in VRAM which adds extra slowness. For 1M elements in parallel.
The runtime of such a long serial loop full of slow trigonometric functions is in the minutes. Your GPU does maybe 5000 of the 1M threads at a time, so total runtime for the kernel is minutes × 200 = several hours.
Of course this will lock up your system. ;)
GPU parallelization = split the problem up into as many individual threads as possible. What on the CPU looks like a loop
for(uint id=0; id<1000000; id++) {
x[id] = sin(x[id]);
}
on the GPU becomes this kernel
kernel void sin_kernel(float* x) {
const uint id = get_global_id(0);
x[id] = sin(x[id]);
}
where all iterations are computed concurrently and in random order. That means: every GPU thread computes only a single iteration of the loop, or x[id] = sin(x[id])
. Remove the loop from your kernel. The kernel is already parallelized over 1M elements; this range is passed as global range to the kernel on the host side.