androidc++armprofilingthreadpool

C++ Threadpool: Why doesn't it scale perfectly on Android & Arm Tablet


I am trying to write a threadpool that divides linearly separable functions (loops) among threads. On my X86 laptop the threadpool scales linearly to the microsecond. On an Arm Tablet (Big Little Architecture) with an Android OS, the scaling is not perfect.

The numbers are:

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------------------
BM<dispatch_single>/8/real_time                                                    1.17 ms         1.17 ms          598
BM<dispatch_single>/16/real_time                                                   2.33 ms         2.33 ms          300
BM_threadpool<dispatch_threadpool, ThreadPool, num_threads>/8/real_time           0.756 ms        0.672 ms          702
BM_threadpool<dispatch_threadpool, ThreadPool, num_threads>/16/real_time           1.48 ms         1.30 ms          430

A perfect scaling would mean that the wall-clock time of the last two outputs is about ~0.6ms, ~1.2ms. The header of the threadpool is as follows:

#include <stddef.h>

#include <atomic>
#include <thread>
#include <vector>

class ThreadPool {
    struct ThreadStorage {
        size_t start{0};
        size_t end{0};
        ThreadPool* pool{nullptr};
        std::atomic_flag flag{false};
        void main();
    };

   public:
    using function = void(void* context, size_t idx);
    explicit ThreadPool(size_t num_threads);
    ~ThreadPool();
    void QueueTask(function*, void* context, size_t range);

   private:
    std::atomic<bool> terminate{false};  // Stop looking for jobs
    std::atomic_flag finish_flag{false};
    std::vector<std::jthread> threads;
    std::vector<ThreadStorage> storage;
    std::atomic<size_t> completed_tasks{0};  // wait on new jobs or termination
    function* task{nullptr};
    void* ctx{nullptr};
};

and the implementation is

#include "threadpool.h"

ThreadPool::ThreadPool(size_t num_threads)
    : threads(num_threads - 1), storage(num_threads), completed_tasks(0) {
    for (uint32_t ii = 0; ii < num_threads; ++ii) {
        storage[ii].pool = this;
        storage[ii].id = ii + 1;
        if (ii > 0) {
            threads[ii - 1] =
                std::jthread(&ThreadPool::ThreadStorage::main, &storage[ii]);
        }
    }
}

void ThreadPool::ThreadStorage::main() {
    while (true) {
        {
            flag.wait(false);
            if (pool->terminate) {
                return;
            }
        }
        for (size_t i = start; i < end; ++i) {
            (*(pool->task))(pool->ctx, i);
        }
        flag.clear();
        if ((pool->completed_tasks.fetch_add(1) + 1) == pool->threads.size()) {
            pool->finish_flag.test_and_set();
            pool->finish_flag.notify_one();
        }
    }
}

void ThreadPool::QueueTask(ThreadPool::function* function, void* context,
                           size_t range) {
    {
        size_t start{0};
        const size_t stepsize{range / storage.size()};
        size_t end{stepsize};
        task = function;
        ctx = context;
        completed_tasks = 0;
        for (ThreadStorage& thread : storage) {
            thread.start = start;
            thread.end = end;
            thread.flag.test_and_set();
            thread.flag.notify_one();
            start = end;
            end += stepsize;
        }
    }
    for (size_t i = storage[0].start; i < storage[0].end; ++i) {
        (*task)(ctx, i);
    }
    {
        finish_flag.wait(false);
        completed_tasks = 0;
        finish_flag.clear();
    }
}

ThreadPool::~ThreadPool() {
    terminate = true;
    for (ThreadStorage& thread : storage) {
        thread.flag.test_and_set();
        thread.flag.notify_one();
    }
}

My questions are:


Solution

  • The reason why the runtime didn't scale well with the number of threads is the aggressive frequency scaling on the target device. First of all, the average clock frequency falls with more cores being busy. One top of that, the clock speeds also differ which makes dividing work equally also problematic. This didn't happen so much on my development machine, and didn't happen at all with ARM server on which I tested the threadpool.