[SOLVED] How is cilk reduce done (thread vs smid)

How is cilk reduce done (thread vs smid)

I have something like that :

  for (b=from; b<to; b++) 
  {
    for (a=from2; a<to2; a++) 
    {
      dest->ac[b] += srcvec->ac[a] * srcmatrix->weight[a+(b+from)*matrix_width];
    }
  }

that i'd like to parallelize using cilk. I have written the following code :

for ( b=from; b<to; b++) 
{
  dest->ac[b] =+  __sec_reduce_add(srcvec->ac[from2:to2-from2] * (srcmatrix->weight+(b*matrix_width))[from2:to2-from2]);
}

but the thing is, I could use a cilk_for on the primary loop, but if the reduce operation is already spawning thread, won't the cilk_for augment the thread overhead, and slow the whole thing down ? And should I add restrict to dest and src args to further help the compiler ? or is it implicit in this case ?

(ps: I can't try the code right now because of

internal compiler error: in find_rank, at c-family/array-notation-common.c:244

neu1b->ac[0:layer1_size]=neu1->ac[0:layer1_size];

that i'am trying to solve also.)

Solution

restrict is not implicitely the case. Furthermore Cilk is implemented using the work-stealing concept. Cilk does not necessarily spawn extra threads for extra work. It works with pushing tasks on a work stack. More info about the internal working can be found on the Cilk FAQ. The Intel compiler might handle things differently than GCC with Cilk. Intel vTune and the intel vectorization report can help you to measure performance differences and indicate whether it's compiled to SIMD or not. With the Intel compiler you can also indicate SIMD operations as follows:

#pragma simd above your loop

array notations: a[:] = b[:] + c[:] to program vectorized array operations.