I have this self-contained example of a TBB application that I run on a 2-NUMA-node CPU that performs a simple vector addition repeatedly on dynamic arrays. It recreates an issue that I am having with a bit more complicated example. I am trying to divide the computations cleanly between the available NUMA nodes by initializing the data in parallel with 2 task_arenas
that are linked to separate NUMA nodes through TBB's NUMA API. The subsequent parallel execution should then be conducted so that that memory accesses are performed on data that is local to the cpu that computes its task. A control example uses a simple parallel_for
with a static_partitioner
to perform the computation while my intended example invokes per task_arena
a task which invokes a parallel_for
to compute the vector addition of the designated region, i.e. the half of the dynamic arena that was initialized before in the corresponding NUMA node. This example always takes twice as much time to perform the vector addition compared to the control example. It cannot be the overhead of creating the tasks for the task_arena
s that will invoke the parallel_for
algorithms, because the performance degradation only occurs when the tbb::task_arena::constraints
are applied. Could anyone explain to me what happens and why this performance penalty is so harsh. A direction to resources would also be helpful as I am doing this for a university project.
#include <iostream>
#include <iomanip>
#include <tbb/tbb.h>
#include <vector>
int main(){
std::vector<int> numa_indexes = tbb::info::numa_nodes();
std::vector<tbb::task_arena> arenas(numa_indexes.size());
std::size_t numa_nodes = numa_indexes.size();
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
std::size_t size = 10000000;
std::size_t part_size = std::ceil((float)size/numa_nodes);
double * A = (double *) malloc(sizeof(double)*size);
double * B = (double *) malloc(sizeof(double)*size);
double * C = (double *) malloc(sizeof(double)*size);
double * D = (double *) malloc(sizeof(double)*size);
//DATA INITIALIZATION
for(unsigned k = 0; k < numa_indexes.size(); k++)
arenas[k].execute(
[&](){
std::size_t local_start = k*part_size;
std::size_t local_end = std::min(local_start + part_size, size);
tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
[&](std::size_t i)
{
C[i] = D[i] = 0;
A[i] = B[i] = 1;
}, tbb::static_partitioner());
});
//PARALLEL ALGORITHM
tbb::tick_count t0 = tbb::tick_count::now();
for(int i = 0; i<100; i++)
tbb::parallel_for(static_cast<std::size_t>(0), size,
[&](std::size_t i)
{
C[i] += A[i] + B[i];
}, tbb::static_partitioner());
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time 1: " << (t1-t0).seconds() << std::endl;
//TASK ARENA & PARALLEL ALGORITHM
t0 = tbb::tick_count::now();
for(int i = 0; i<100; i++){
for(unsigned k = 0; k < numa_indexes.size(); k++){
arenas[k].execute(
[&](){
for(unsigned i=0; i<numa_indexes.size(); i++)
task_groups[i].wait();
task_groups[k].run([&](){
std::size_t local_start = k*part_size;
std::size_t local_end = std::min(local_start + part_size, size);
tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
[&](std::size_t i)
{
D[i] += A[i] + B[i];
});
});
});
}
t1 = tbb::tick_count::now();
std::cout << "Time 2: " << (t1-t0).seconds() << std::endl;
double sum1 = 0;
double sum2 = 0;
for(int i = 0; i<size; i++){
sum1 += C[i];
sum2 += D[i];
}
std::cout << sum1 << std::endl;
std::cout << sum2 << std::endl;
return 0;
}
Performance with:
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.896496
Time 2: 1.60392
2e+07
2e+07
Performance without constraints:
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.652501
Time 2: 0.638362
2e+07
2e+07
EDIT: I implemented the use of task_group
as found in @AlekseiFedotov's suggested resources, but the issue still remains.
Part of the provided example where the work with arenas happens is not one-to-one match to the example from the docs, "Setting the preferred NUMA node" section.
Looking further into the specification of the task_arena::execute()
method, we can find out that the task_arena::execute()
is a blocking API, i.e. it does not return until the passed lambda completes.
On the other hand, the specification of the task_group::run()
method reveals that its method is asynchronous, i.e. returns immediately, not waiting for the passed functor to complete.
That is where the problem lies, I guess. The code executes two parallel loops within arenas one by one, in a serial manner so to say. Consider following the example from the docs carefully.
BTW, the oneTBB project, which is the revamped version of the TBB, can be found here.
EDIT answer for the EDITED question:
arena[i].execute( [i, &] { task_group[i].run( [i, &] { /*...*/ } ); } )
, then, in another loop, wait for each task_group
within corresponding task_arena
.Please note how I capture the NUMA loop iteration by copy. Otherwise, the code might be referring the wrong data inside the lambda body.