arraysparallel-processingcilkcilk-plus

The strange copy behavior of Intel Cilk Plus array notation


I'm using Intel Cilk Plus array notation to practice vector programming. However, I met strange copy behaviors of array assignments.

The problem to solve is parallel prefix. D is input, and P is output.

//Wrong result code. Tested on Intel icc 14.0.2.

void vec_prefix(int n, int D[n], int P[n]) {
  P[0:n] = D[0:n]; //initial copy
  int bound=1;
  int * Buf1 = _mm_malloc(n*sizeof(int), 16);
  int * Buf2 = _mm_malloc(n*sizeof(int), 16);

  while(bound<n){
    Buf1[0:n-bound] = P[0:n-bound];
    Buf2[0:n-bound] = P[bound:n-bound];
    //printf("WHY??\n");  //Add this fence, the result will be correct.
    P[bound:n-bound] = Buf1[0:n-bound] + Buf2[0:n-bound];
    bound<<=1;
  }
  _mm_free(Buf1); _mm_free(Buf2);
}

If I remove the comment of the line "printf", the result is correct. Otherwise, wrong. But if all copies follow the descriptions in the Intel's document, the code should be correct.

It seems if there is no such as memory fence, the first two lines' Buf1/Buf2 copy are not finished, and the add operation later uses some unstable value. Or, the compiler optimizer just use copy-propagation to remove the copy, and create "P[bound:n-bound] = P[0:n-bound] + P[bound:n-bound]". This is undefined in Intel's document.

//Correct result code

void vec_prefix(int n, int D[n], int P[n]) {
  P[0:n] = D[0:n]; //initial copy
  int bound=1;
  int * Buf1 = _mm_malloc(n*sizeof(int), 16);

  while(bound<n){
    //direct copy part
    Buf1[0:bound] = P[0:bound];
    //add part
    Buf1[bound:n-bound] = P[bound:n-bound] + P[0:n-bound];
    //copy back
    P[0:n] = Buf1[0:n];
    bound<<=1;
  }
  _mm_free(Buf1);
}

Ref: Intel's document


Solution

  • Your code looks correct to me. No "fence" should be required. I filed an internal bug report for it with the Intel compiler group with an example of all ones as input. Thanks for the example.

    If the compiler were working correctly, you could even shorten the while loop to:

    while(bound<n){
        Buf1[0:n-bound] = P[0:n-bound];
        P[bound:n-bound] = Buf1[0:n-bound] + P[bound:n-bound];
        bound<<=1;
    }
    

    The second array section assignment is okay since the overlap for P is exactly. Alas icc exhibits its bug for this example too.