cmultithreadingparallel-processingcilkcilk-plus

Why does CILK_NWORKERS affect program with only one cilk_spawn?


I am trying to paralellize a matrix processing program. After using OpenMP I decided to also check out CilkPlus and I noticed the following:

In my C code, I only apply parallelism in one part i.e.:

//(test_function declarations)

cilk_spawn highPrep(d, x, half);

d = temp_0;
r = malloc(sizeof(int)*(half));
temp_1 = r;
x = x_alloc + F_EXTPAD;
lowPrep(r, d, x, half);

cilk_sync;

//test_function return

According to the documentation I have read so far, cilk_spawn is expected to -maybe- (CilkPlus does not enforce parallelism) take the highPrep() function and execute it in a different hardware thread should one be available, and then continue with the execution of the rest of the code, including the function lowPrep(). The threads then should synchronize at cilk_sync before the execution proceeds with the rest of the code.

I am running this on an 8core/16thread Xeon E5-2680, that does not execute anything else at any given time apart from my experiments. My question at this point is that I notice that when I change the environment variable CILK_NWORKERS and try values such as 2, 4, 8, 16 the time that the test_function requires to be executed changes with a big variation. In particular, the higher the CILK_NWORKERS is set (after 2) the slower the function becomes. This seems counter intuitive to me since I would expect the available number of threads not to change the operation of cilk_spawn. I would expect that if 2 threads are available then the function highPrep is executed on another thread. Anything more than 2 threads I would expected to remain idle.

The highPrep and lowPrep functions are:

void lowPrep(int *dest, int *src1, int *src2, int size)
{
   double temp;
   int i;
   for(i = 0; i < size; i++)
   {
      temp = -.25 * (src1[i] + src1[i + 1]) + .5;
      if (temp > 0)
         temp = (int)temp;
      else
      {
          if (temp != (int)temp)
              temp = (int)(temp - 1);
      }
      dest[i] = src2[2*i] - (int)(temp);
   }
}

void highPrep(int *dest, int *src, int size)
{
   double temp;
   int i;
   for(i=0; i < size + 1; i++)
   {
      temp = (-1.0/16 * (src[-4 + 2*i] + src[2 + 2*i]) + 9.0/16 * (src[-2 + 2*i] + src[0 + 2*i]) + 0.5);
      if (temp > 0)
           temp = (int)temp;
      else
      {
         if (temp != (int)temp)
              temp = (int)(temp - 1);
      }
      dest[i] = src[-1 + 2*i] - (int)temp;
    }
}

There must be a reasonable explanation behind this, is it reasonable to expect different execution times for a program like this?


Solution

  • Clarification: Cilk does "continuation stealing", not "child stealing", so highPrep always runs on the same hardware thread as its caller. It's the "rest of the code" that might end up running on a different thread. See this primer for a fuller explanation.

    As to the slowdown, it's probably an artifact of the implementation being biased towards high degrees of parallelism that can consume all threads. The extra threads are looking for work, and in the process of doing so, eat up some memory bandwidth, and for hyperthreaded processors eat up some core cycles. The Linux "Completely Fair Scheduler" given us some grief in this area because sleep(0) no longer gives up the timeslice. It's also possible that the extra threads cause the OS to map software threads onto the machine less efficiently.

    The root of the problem is a tricky tradeoff: Running thieves aggressively enables them to pick up work faster if it appears, but also causes them to unnecessarily consume resources if no work is available. Putting thieves to sleep when there is no work available saves the resources, but adds significant overhead to spawn, since now a spawning thread has to check if there are sleeping threads to be woken up. TBB pays this overhead, but it's not much for TBB because TBB's spawn overhead is much higher anyway. The current Cilk implementation does pay this tax: it only sleeps workers during sequential execution.

    The best (but imperfect) advice I can give is to find more parallelism so that no worker threads loiter for long and cause trouble.