graphicsparallel-processingcpuoctree

Parallel octree construction on the CPU


I did a simple implementation of an octree. Now I'm trying to make the construction of the tree parallel on the CPU.

First I tried to make the step of adding points to the children of the tree parallel (being the most costly step in the construction), but due to having to lock the vector/list each time I add a point it didn't gain any performance benefits.

Now I'm trying to make the construction of each node in the tree parallel. The idea is simple and should be straight forward as there is no intersection between the nodes. I simply need to assign each thread a node to work on. The issue is that it is a nested top-down implementation so I'm not sure what is the best way to implement this.

I'm using C++ and OpenMP. I tried writing this inside the build function:

 omp_set_nested(1);
#pragma omp parallel for schedule(dynamic)
    for (int i = 0; i < child_count; i++) {
        _child[i]->build(threshold, maximumDepth, currentDepth + 1);
    }

But the performance became way worse than the sequential one.

Then I tried to parallelize just the top 8 nodes (of the root node). This gave me a performance gain of X2-X3. However it depends heavily on the scene. if my scene is way too unbalanced the parallelism will have very few benefits as 7 of the top 8 nodes could be almost empty and one node has all the other points.

Any thoughts on how to do this correctly?


Solution

  • Use tasks instead of nested parallelism. Try this code:

    #pragma omp parallel
    #pragma omp single nowait
    {       
                \\ first call to build function       
    }
    

    Inside build function use the following code. Note that currentDepth> XXX is a condition to stop creating more tasks.

    for (int i = 0; i < child_count; i++) 
    {
       #pragma omp task final( currentDepth> XXX ) mergeable \
       default(none) firstprivate(threshold, maximumDepth, currentDepth)
         _child[i]->build(threshold, maximumDepth, currentDepth + 1);
    }
    

    If child_count > 1 the following code may be faster:

    for (int i = 0; i < child_count-1; i++) 
    {
       #pragma omp task final( currentDepth> XXX ) mergeable \
       default(none) firstprivate(threshold, maximumDepth, currentDepth)
         _child[i]->build(threshold, maximumDepth, currentDepth + 1);
    }
    _child[child_count-1]->build(threshold, maximumDepth, currentDepth + 1);