chapel

Idiomatic Chapel Way to Create Uneven Distribution


I'm working on porting a distributed memory Samplesort from MPI+C to Chapel, and I've been unable to find an idiomatic/clean way to return the sorted data in a single distributed array.

The C Samplesort is called from within each MPI process, and is passed a portion of the global data to be sorted. This was pretty simple to do in Chapel - I created a Block-distributed array, and the owner of each subdomain reads the appropriate data from disk into its portion of the array.

The actual Samplesort can then run, with the data evenly distributed among each locale. The difficult part comes at the end of the sort; since the data are randomized, the Samplesort assigns an uneven (but roughly similar) range of elements to each Locale. That is, if I'm sorting N records on P nodes, each node starts with N/P records, but after shuffling some may end up with slightly more than N/P records.

In MPI+C, this isn't that hard to handle - I know how many elements each process will receive from all the other processes, so I can dynamically allocate enough memory on each node to receive all the records from the other processes. This is also easily doable in Chapel - each locale (with a bit of bookkeeping) can figure out how many records it's going to need to fetch from the other locales, so it can create a local array of the correct size. However, to make the API convenient and self-contained, I'd like to pass the Samplesort procedure the Block-distributed array and get back a reference to another Block-distributed array containing the sorted data.

To that end, I've been unable to find a good way to create a distributed array that assigns space to each locale to store the uneven number of elements that they end up with at the end of the sort. Is there a built-in idiomatic way to do this in Chapel?

The best alternative I've come up with is to populate a distributed array of references, one per locale, which point to the locale-specific arrays of unequal size. However, this seems like a bit of a hack in that you provide the Samplesort a single array and you receive an array of references to other arrays back.


Solution

  • [Sorry for the delayed response]

    We don't currently provide a distribution in Chapel that is Block-like, yet with an arbitrary number of elements per locale (rather than the n/p +/-1 approach that Block takes). We've had a user implement such a distribution in Chapel in the past (users can write their own distributions and layouts for arrays in Chapel); but unfortunately that code was not contributed back to the project. If you'd like to advocate for us creating and providing such a distribution, please feel encouraged to add it as a feature request on Chapel's GitHub issues page.

    If maintaining the imbalanced load after the sort is complete isn't important to your computation, one approach you might consider would be to simply return the result as a block-distributed array, leaning on Chapel's global address space to take care of the non-local accesses. E.g., if my locale is logically storing global elements [lo..#myLocElems] using a buffer of size [0..<myLocElems], I could use an assignment from the buffer to the appropriate slice of the global array to put them in the right place, regardless of whether they're all local or some (or all) are remote.

    Sketching out what that might look like:

    proc CountingSort(InputArr: [?D] ?t) {
      // an array for the result
      //   (note that InputArr could potentially be re-used instead,
      //   depending on the interface you want)
      var ResultArr[D] t;
    
      coforall loc in Locales do on loc {
        var myLocElems: int;
    
        // ... start counting sort, determining `myLocElems`...
    
        var myBuff: [0..<myLocElems] t;  // a bucket for elements I receive
        const lo: int;                   // the global index of myBuff[0]
    
        // ... continue sort, filling myBuff and computing lo...
    
        // copy my elements to the appropriate place in the result array
        ResultArr[lo..#myLocElems] = myBuff; 
      }
      return ResultArr;
    }
    

    While this will induce some communication relative to using an "imbalanced block" distribution that could just store all of myBuff into a distributed array locally, that overhead seems likely to be in the noise compared to the expense of doing the sort itself. And a benefits of using Block are that (a) subsequent operations on the array will be load-balanced and (b) the location of any element can be found using O(1) math.

    That's not to say that an "imbalanced block" distribution wouldn't be of interest in some cases. Note that it would likely require O(numLocales) memory per locale to store the partitioning of the global index set and O(lg(numLocales)) time to determine where any given element lives. But neither of those are likely to be big concerns on today's systems. So the main downside is that it's a distribution that hasn't been written (yet).