OpenCL Kernel slow and doesn't utilise CPU fully

I tried to do an old advent of code problem in OpenCL, but it's very slow.

const char *KernelSource_part_b = "\n" \
"typedef unsigned long uint64_t;                                           \n" \
"                                                                          \n" \
"__kernel void kernel_part_b(                                              \n" \
"__global uint64_t* seed_ranges,                                           \n" \
"const uint64_t num_seed_ranges,                                           \n" \
"__global uint64_t* seed_map_layer_sizes,                                  \n" \
"__global uint64_t* flat_seed_map_layers,                                  \n" \
"const uint64_t num_seed_map_layers,                                       \n" \
"__global uint64_t* results                                                \n" \
")                                                                         \n" \
"{                                                                         \n" \
"    uint64_t index = 0;                                                   \n" \
"                                                                          \n" \
"    index = get_global_id(0);                                             \n" \
"                                                                          \n" \
"    if (index >= num_seed_ranges) return;                                 \n" \
"                                                                          \n" \
"    uint64_t count = 0;                                                   \n" \
"    uint64_t min_value = ULONG_MAX;                                       \n" \
"    for (                                                                 \n" \
"        uint64_t seed = seed_ranges[index * 2];                           \n" \
"        seed < (seed_ranges[index * 2] + seed_ranges[(index * 2) + 1]);   \n" \
"        seed++                                                            \n" \
"    )                                                                     \n" \
"    {                                                                     \n" \
"        uint64_t seed_val = seed;                                         \n" \
"        uint64_t *layer_ptr = flat_seed_map_layers;                       \n" \
"        for (                                                             \n" \
"            uint64_t sml_index = 0;                                       \n" \
"            sml_index < num_seed_map_layers;                              \n" \
"            sml_index++                                                   \n" \
"        )                                                                 \n" \
"        {                                                                 \n" \
"            uint64_t *map_ptr = layer_ptr;                                \n" \
"            uint64_t num_maps = seed_map_layer_sizes[sml_index];          \n" \
"            for (                                                         \n" \
"                uint64_t ml_index = 0;                                    \n" \
"                ml_index < num_maps;                                      \n" \
"                ml_index++                                                \n" \
"            )                                                             \n" \
"            {                                                             \n" \
"                uint64_t source = *(map_ptr);                             \n" \
"                uint64_t target = *(map_ptr + 1);                         \n" \
"                uint64_t size = *(map_ptr + 2);                           \n" \
"                if((seed_val >= source) && (seed_val < (source + size)))  \n" \
"                {                                                         \n" \
"                    seed_val = seed_val - source + target;                \n" \
"                    break;                                                \n" \
"                }                                                         \n" \
"                map_ptr += 3;                                             \n" \
"            }                                                             \n" \
"            layer_ptr += (num_maps * 3);                                  \n" \
"        }                                                                 \n" \
"        if (seed_val < min_value) min_value = seed_val;                   \n" \
"    }                                                                     \n" \
"                                                                          \n" \
"    results[index] = min_value;                                           \n" \
"    return;                                                               \n" \
"}                                                                         \n" \
"\n";

The code and a devcontainer (which should just run) is available here.

I've written this same algorithm using OpenMP, Rust and Python utilising some of the multiprocessing libraries and they all peg the CPU at 100% utilisation on all cores, but this OpenCL example doesn't and I don't know how to start debugging why. On my 13850HX it takes 1.5 minutes for OpenCL to compute the answer, but the OpenMP and Rust version takes seconds. From top it looks like it's only using 1 CPU

The enqueue is called with the number of CPU cores the system has available.

#define NUM_CORES 28

...
Ranges are split into NUM_CORES entries to allocate one range to each core
...

uint64_t global = num_seed_ranges;
        error = clEnqueueNDRangeKernel(commands, ko_part_b, 1, NULL, &global, NULL, 0, NULL, NULL);

Performance numbers

vscode ➜ .../2023/day05/c/build-x86 (main) $ time ./day5 -i ../../full_data.txt -r part_b
2023 - Day 5
Running Part B
Result is: 79004094

real    4m5.865s
user    4m5.999s
sys     0m0.001s

vscode ➜ .../2023/day05/c/build-x86 (main) $ time ./day5 -i ../../full_data.txt -r part_b_openmp
2023 - Day 5
Running Part B OpenMP
Result is: 79004094

real    0m23.448s
user    7m55.145s
sys     0m0.006s

vscode ➜ .../2023/day05/c/build-x86 (main) $ time ./day5 -i ../../full_data.txt -r part_b_opencl
2023 - Day 5
Running Part B OpenCL
Num platforms 1
Result is: 79004094

real    1m41.102s
user    1m40.903s
sys     0m0.073s

vscode ➜ .../adventofcode/2023/day05/rust (main) $ time cargo run ../full_data.txt part_b_parallel_forward
    Finished `dev` profile [optimized + debuginfo] target(s) in 0.04s
     Running `target/debug/day5 ../full_data.txt part_b_parallel_forward`
path: "../full_data.txt", run: "part_b_parallel_forward"
Part B forward parallel: 79004094

real    0m9.667s
user    4m27.507s
sys     0m0.142s

I was trying to get it running with the GPU as well but I'm having problems getting the docker container to find the GPU (separate problem)

The CUDA kernel version is similar and runs on 3 groups with 1024 cores each in about 24 seconds.

https://github.com/richClubb/adventofcode/blob/main/2023/day05/cuda/part_b/part_b.cu

Solution

Figured it out, it was related to the work sizes.

By setting the local_work_size to NULL I think it's iterating single process through the seed_ranges, if you set the global_work_size to 28 (number of cores) and the local_work_size to 1 then it will fully utilise the CPU.

I didn't change the work_dim though.

uint64_t global = num_seed_ranges; // 28 in my case
uint64_t local = 1;
error = clEnqueueNDRangeKernel(
    commands, //command queue
    ko_part_b, // kernel
    1, NULL, // work dimension stuff
    &global, // global work size (num of cores) 
    &local, // local work size (1)
    0, NULL, NULL // event queue stuff
);

Final Results:

C Single thread - 4 mins
C OpenMP - 23 seconds
C OpenCL - 9 seconds
Rust single threaded - 1.5 mins
Rust rayon multiprocess - 7 seconds
Cuda 3072 cores (2000 series) - 9 seconds