parallel-processinghpclow-latencychapelparallelism-amdahl

Efficient collection and transfer of scattered sub-arrays in Chapel


Recently, I came across Chapel. I liked the examples given in the tutorials but many of them were embarrassingly parallel in my eyes. I'm working on Scattering Problems in Many-Body Quantum Physics and a common problem can be reduced to the following.

  1. A tensor A of a shape M x N x N is filled with the solution of a Matrix equation for M different parameters 1..M
  2. A subset of the Tensor A is needed to compute a correction term for each of the parameters 1..M.

The first part of the Problem is embarrassingly parallel.

My question is thus if and how it is possible to transfer only the needed subset of the tensor A to each of the locales of a cluster and minimize the necessary communication?


Solution

  • When Chapel is doing its job right, transfers of array slices between distributed and local arrays (say) should be performed in an efficient manner. This means that you should be able to write such tensor-subset transfers using Chapel's array slicing notation.

    For example, here's one way to write such a pattern:

    // define a domain describing a 5 x 7 x 3 index set anchored at index (x,y,z)
    const Slice = {x..#5, y..#7, z..#3};
    
    // create a new array variable that stores the elements from distributed array 
    // `myDistArray` locally
    var myLocalArray = myDistArray[Slice];
    

    The new variable myLocalArray will be an array whose elements are copies of the ones in myDistArray as described by the indices in Slice. The domain of myLocalArray will be the slicing domain Slice, so since Slice is a non-distributed domain, myLocalArray will also be a local / non-distributed array, and therefore won't incur any of the overheads of using Chapel's distributed array notation when it's operated on from the current locale.

    To date, we have focused principally on optimizing such transfers for Block-distributed arrays. For example, for cases like the above example, when myDistArray is Block-distributed, I'm seeing a fixed number of communications between the locales as I vary the size of the slice (though the size of those communications would obviously vary depending on the number of elements that need to be transferred). Other cases and patterns are known to need more optimization work, so if you find a case that isn't performing / scaling as you'd expect, please file a Chapel GitHub issue against it to help alert us to your need and/or help you find a workaround.

    So, sketching out the pattern you describe, I might imagine doing something like:

    // create a local and distributed version of the complete tensor space
    const LocTensorSpace = {1..M, 1..N, 1..N},
          TensorSpace = LocTensorSpace dmapped Block(LocTensorSpace);
    
    // declare array A to store the result of step 1
    var A: [TensorSpace] real;
    
    // ...compute A here...
    
    // declare a 1D distributed form of the parameter space to drive step 2    
    const ParameterSpace = {1..M} dmapped Block({1..M});
    
    // loop over the distributed parameter space; each locale will use all its cores
    // to compute on its subset of {1..M} in parallel
    forall m in ParameterSpace {
      // create a local domain to describe the indices you want from A
      const TensorSlice = { /* ...whatever indices you need here... */ };
    
      // copy those elements into a local array
      var locTensor = A[TensorSlice];
    
      // ...compute on locTensor here...
    }
    

    Some other things that seem related to me, but which I don't want to bog this question down with are:

    (So feel free to ask follow-up questions if these are of interest)

    Finally, for the sake of posterity, here's the program I wrote up while I was putting this response together to make sure I'd get the behavior I expected in terms of numbers of communications and getting a local array (this was with chpl version 1.23.0 pre-release (ad097333b1), though I'd expect the same behavior for recent releases of Chapel:

    use BlockDist, CommDiagnostics;
    
    config const M = 10, N=20;
    
    const LocTensorSpace = {1..M, 1..N, 1..N},
          TensorSpace = LocTensorSpace dmapped Block(LocTensorSpace);
    
    var A: [TensorSpace] real;
    
    forall (i,j,k) in TensorSpace do
      A[i,j,k] = i + j / 100.0 + k / 100000.0;
    
    
    config const xs = 5, ys = 7, zs = 3,            // size of slice                
                 x = M/2-xs/2, y = N/2-ys/2, z = N/2-zs/2;  // origin of slice      
    
    
    const Slice = {x..#xs, y..#ys, z..#zs};
    
    writeln("Copying a ", (xs,ys,zs), " slice of A from ", (x,y,z));
    
    resetCommDiagnostics();
    startCommDiagnostics();
    
    var myLocArr = A[Slice];
    
    stopCommDiagnostics();
    writeln(getCommDiagnostics());
    
    writeln(myLocArr);
    writeln(myLocArr.isDefaultRectangular());