Below I try to sketch code that is parallelised using OpenMP tasks.
In the main function a parallel environment is started and immediately after doing so the code is wrapped into a #pragma omp master
section. After computing the expected workload and depending on whether this workload is below a given threshold, the stuff that has to be done will either be passed to a serial function or to a function that will recursively split the workload and initialise separate tasks. The results from each single #pragma omp task
are then aggregated after a #pragma omp taskwait
directive.
int main() {
#pragma omp parallel
{
#pragma omp master
{
//do some serial stuff
//estimate if parallelisation is worth it.
const int workload = estimateWorkload();
if (workload < someBound) {
serialFunction();
}
else {
parallelFunction(workload);
}
}
}
}
int parallelFunction(int workload) {
if (workload < someBound) {
return serialFunction();
}
int result1, result2;
#pragma omp task shared(result1)
{
result1 = parallelFunction(workload/2);
}
#pragma omp task shared(result2)
{
result2 = parallelFunction(workload/2);
}
#pragma omp taskwait
return result1 < result2;
}
How do I measure the actual computing time of each thread in such a setting?
If I measure CPU time and have k
active threads, then I will get k*wallTime
, which makes sense because threads are initialized by the leading #pragma omp parallel
directive and remain active all the time. However this does not give me any information on how much time the threads spend actually working, which makes the code hard to analyze.
Q : How do I measure the actual computing time of each thread in such a setting?
A trivial MOCK-UP CODE for a simple, semi-manual code-execution time profiling:
It is needless to say, that for "noisy" execution-platform, the choice of the CLOCK_MONOTONIC
saves drifting time updates, yet does not "save" off-CPU-core wait-states, due to any (the more if heavy) "background"-(disturbing)-processes scheduled by the O/S.
Yet, for a prototyping phase, this is way easier than mounting all the "omp-native" callbacks'{ ompt_callback_task_create_t, ompt_callback_task_schedule_t, ompt_callback_task_dependence_t, ompt_callback_dispatch_t, ompt_callback_sync_region_t, ..., ompt_callback_thread_begin_t, ompt_callback_thread_end_t, ... }
handling.
The trivial code permits, if reporting and post-processing the recorded nested code-execution respective durations, to "frame" the hidden costs of the associated call-signatures' and recursion-nesting related overheads.
The revised, overhead-strict Amdahl's Law then stops lying you and starts to show you more precisely, when this code starts to lose on that very overhead-related ( plus due to a potential atomicity-of-work-unit(s) ) principally-[SERIAL]
-add-on-costs to any expected True-[PARALLEL]
-section(s) speedup ( expected from harnessing more (those and only those otherwise free) resources ).
That is always the hardest part of the War ( still to be fought ahead ... ).
EFFICIENCY of SCHEDULING & OCCUPIED RESOURCES' of a CALL to 2-ary task-SCHEDULED fun() with hidden 1-ary RECURSION:
CALL
42----*--------------------------------------------------------------------------------------*
: | |
: | 21----*---------------------------------------*
: | : | |
: | : | 10----*----------------*
: | : | : | |
: | : | : | 5----*----*
: | : | : | : | |
: | : | : | : | 2<
: | : | : | : 2< /
: | : | : 5----*----* 5___/___/................ #taskwait 2
: | : | : : | | /
: | : | : : | 2< /
: | : | : : 2< / /
: | : | : 5___/___/ /
: | : | 10___/____________/............................. #taskwait 5
: | : 10----*----------------* /
: | : : | | /
: | : : | 5----*----* /
: | : : | : | | /
: | : : | : | 2< /
: | : : | : 2< / /
: | : : 5----*----* 5___/___/ /
: | : : : | | / /
: | : : : | 2< / /
: | : : : 2< / / /
: | : : 5___/___/ / /
: | : 10___/____________/__________/.......................................................... #taskwait 10
: | 21___/
: 21----*---------------------------------------* /
: : | | /
: : | 10----*----------------* /
: : | : | | /
: : | : | 5----*----* /
: : | : | : | | /
: : | : | : | 2< /
: : | : | : 2< / /
: : | : 5----*----* 5___/___/ /
: : | : : | | / /
: : | : : | 2< / /
: : | : : 2< / / /
: : | : 5___/___/ / /
: : | 10___/____________/ /
: : 10----*----------------* / /
: : : | | / /
: : : | 5----*----* / /
: : : | : | | / /
: : : | : | 2< / /
: : : | : 2< / / /
: : : 5----*----* 5___/___/ / /
: : : : | | / / /
: : : : | 2< / / /
: : : : 2< / / / /
: : : 5___/___/ / / /
: : 10___/____________/__________/ /
: 21___/_______________________________________________________/...................................................................................................................... #taskwait 21
42___/
RET_/
of SCHEDULING & OCCUPIED RESOURCES' of a CALL to 2-ary task
-SCHEDULED fun()
with hidden 1-ary RECURSION matters more and more for any growing scale of the workload
becoming soon workload < someBound * 2 ^ W
only at a cost of an awfully high W
( which causes W * k
-ary-many times re-{-acquired, -allocated, -released} Wasted all k
-ary-times requested #pragma omp task shared(...)
-handling-related resources, throughout the progression of the whole pure-[SERIAL]
-by-definition recursion dive-&-resurfacing back ).
It is easy to see how many resources will hang waiting ( due to even 1-ary RECURSION formulation ), until each and every dive to the deepest level of recursion bubbles back to the top level #pragma omp taskwait
.
The costs of re-allocating new and new resources for each recursion-dive level will most often kill you on overhead-strict Amdahl's Law ( performance-wise ), if not getting into thrashing or system-configuration related overflow, due to devastating the real-system physical resources earlier, for any reasonably large recursion-depths.
These are the costs you need not have paid, if not using a "typomatically-cheap"-yet-expensive-in-(idle/wasted)-resources recursive-problem-formulation, even with the most lightweight 1-ary cases.
See how many :
-denoted "waiting-lines" are there in parallel, besides how few |
-denoted "computing-lines" in either phase of the topology, which waste/block, yet must let idle, all task-related resources ( memory & stack-space are just the more visible ones that are performance-wise very expensive to acquire (to just let most of the processing time idle waiting) or prone to crash due to overflow if over-subscribed beyond the real system's configuration capacities ).
The War is yours! Keep Walking ...
The Site-Compliance Disclaimer :
------------------------------------------------------------------------------
As per StackOverflow policy, the full mock-up code is posted here, for any case the Godbolt.org platform might turn inaccessible, otherwise feel free to prefer and/or use the Compiler Explorer tools, that go way, way beyond the sequence-of-chars put into the mock-up source-code there
The choice & the joy from executing it is always yours :o)
#include <time.h>
int estimateWorkload() {
return 42; // _________________________________________________________ mock-up "workload"
}
int serial_a_bit_less_naive_factorial( int n ){
return ( n < 3 ) ? n : n * serial_a_bit_less_naive_factorial( n - 1 );
}
int serialFunction() {
return serial_a_bit_less_naive_factorial( 76 );
}
int parallelFunction( int workload, const int someBound ) { // __ pass both control parameters
struct timespec T0, T1;
int retFlag,
retValue,
result1,
result2;
retFlag = clock_gettime( CLOCK_MONOTONIC, &T0 ); // \/\/\/\/\/\/\/\/\/\ SECTION.begin
if ( workload < someBound ) {
retValue = serialFunction();
}
else { // -- [SEQ]----------------------------------------------------
#pragma omp task shared( result1 ) // -- [PAR]|||||||||||||||||||| with (1-ary recursions)
{
result1 = parallelFunction( (int) workload / 2, someBound ); // (int) fused DIV
}
#pragma omp task shared( result2 ) // -- [PAR]|||||||||||||||||||| with (1-ary recursions)
{
result2 = parallelFunction( (int) workload / 2, someBound ); // (int) fused DIV
}
#pragma omp taskwait
retValue = result1 < result2;
}
retFlag = clock_gettime( CLOCK_MONOTONIC, &T1 ); // \/\/\/\/\/\/\/\/\/\ SECTION.end
// ____________________________________________________________________ MAY ADD ACCUMULATION (1-ary recursions)
// ...
// ____________________________________________________________________ MAY ADD ACCUMULATION (1-ary recursions)
return retValue;
}
int main() {
const int someBound = 3; // _______________________________________ a control parameter A
#pragma omp parallel
{
#pragma omp master
{
// -- [SEQ]---------------------------------------- do some serial stuff
// ------------------------------estimate if parallelisation is worth it
const int workload = estimateWorkload();
if ( workload < someBound ) {
serialFunction();
}
else {
parallelFunction( workload, someBound ); // -- [PAR]||||||| with (1-ary recursions)
}
}
}
}