parallel-processing hpc low-latency chapel parallelism-amdahl

Efficient collection and transfer of scattered sub-arrays in Chapel

Recently, I came across Chapel. I liked the examples given in the tutorials but many of them were embarrassingly parallel in my eyes. I'm working on Scattering Problems in Many-Body Quantum Physics and a common problem can be reduced to the following.

A tensor A of a shape M x N x N is filled with the solution of a Matrix equation for M different parameters 1..M
A subset of the Tensor A is needed to compute a correction term for each of the parameters 1..M.

The first part of the Problem is embarrassingly parallel.

My question is thus if and how it is possible to transfer only the needed subset of the tensor A to each of the locales of a cluster and minimize the necessary communication?

Solution

When Chapel is doing its job right, transfers of array slices between distributed and local arrays (say) should be performed in an efficient manner. This means that you should be able to write such tensor-subset transfers using Chapel's array slicing notation.

For example, here's one way to write such a pattern:

// define a domain describing a 5 x 7 x 3 index set anchored at index (x,y,z)
const Slice = {x..#5, y..#7, z..#3};

// create a new array variable that stores the elements from distributed array 
// `myDistArray` locally
var myLocalArray = myDistArray[Slice];

The new variable myLocalArray will be an array whose elements are copies of the ones in myDistArray as described by the indices in Slice. The domain of myLocalArray will be the slicing domain Slice, so since Slice is a non-distributed domain, myLocalArray will also be a local / non-distributed array, and therefore won't incur any of the overheads of using Chapel's distributed array notation when it's operated on from the current locale.

To date, we have focused principally on optimizing such transfers for Block-distributed arrays. For example, for cases like the above example, when myDistArray is Block-distributed, I'm seeing a fixed number of communications between the locales as I vary the size of the slice (though the size of those communications would obviously vary depending on the number of elements that need to be transferred). Other cases and patterns are known to need more optimization work, so if you find a case that isn't performing / scaling as you'd expect, please file a Chapel GitHub issue against it to help alert us to your need and/or help you find a workaround.

So, sketching out the pattern you describe, I might imagine doing something like:

// create a local and distributed version of the complete tensor space
const LocTensorSpace = {1..M, 1..N, 1..N},
      TensorSpace = LocTensorSpace dmapped Block(LocTensorSpace);

// declare array A to store the result of step 1
var A: [TensorSpace] real;

// ...compute A here...

// declare a 1D distributed form of the parameter space to drive step 2    
const ParameterSpace = {1..M} dmapped Block({1..M});

// loop over the distributed parameter space; each locale will use all its cores
// to compute on its subset of {1..M} in parallel
forall m in ParameterSpace {
  // create a local domain to describe the indices you want from A
  const TensorSlice = { /* ...whatever indices you need here... */ };

  // copy those elements into a local array
  var locTensor = A[TensorSlice];

  // ...compute on locTensor here...
}

Some other things that seem related to me, but which I don't want to bog this question down with are:

If desired, TensorSpace / A could be declared such that only the 1..M dimension is distributed across locales and the {1..N, 1..N} planes are local
There are also ways to query what indices of a distributed array a locale owns; combined with the previous point, this could be a way to reduce the amount of communication required assuming there's a correspondence between the iterations of step 2 and the planes of A
There are also ways to refer to a distributed array slice in-place and/or to give it a symbolic name rather than creating a local copy of it as suggested above
If desired/preferred A could be declared as a 1D distributed array of 2D arrays, though this may not be as nice if you want to access 3D slices of the space

(So feel free to ask follow-up questions if these are of interest)

Finally, for the sake of posterity, here's the program I wrote up while I was putting this response together to make sure I'd get the behavior I expected in terms of numbers of communications and getting a local array (this was with chpl version 1.23.0 pre-release (ad097333b1), though I'd expect the same behavior for recent releases of Chapel:

use BlockDist, CommDiagnostics;

config const M = 10, N=20;

const LocTensorSpace = {1..M, 1..N, 1..N},
      TensorSpace = LocTensorSpace dmapped Block(LocTensorSpace);

var A: [TensorSpace] real;

forall (i,j,k) in TensorSpace do
  A[i,j,k] = i + j / 100.0 + k / 100000.0;


config const xs = 5, ys = 7, zs = 3,            // size of slice                
             x = M/2-xs/2, y = N/2-ys/2, z = N/2-zs/2;  // origin of slice      


const Slice = {x..#xs, y..#ys, z..#zs};

writeln("Copying a ", (xs,ys,zs), " slice of A from ", (x,y,z));

resetCommDiagnostics();
startCommDiagnostics();

var myLocArr = A[Slice];

stopCommDiagnostics();
writeln(getCommDiagnostics());

writeln(myLocArr);
writeln(myLocArr.isDefaultRectangular());