chapel

Increasing Transaction Size of Communication Operations


I'm (still) working on writing a reference distributed sort based on an old C+MPI Samplesort. The largest source of overhead is now the all-to-all communication step that happens in the middle of the sort, when each locale needs to fetch the data for which it is responsible from the other locales. The relevant line of code for this operation is:

local_block.records[local_start_idx..local_end_idx] = records[remote_start_idx..remote_end_idx];

where local_block.records is a private array declared within the on clause in each locale to store its data, and records is a block-distributed array of all of the input records. In the case of the test I'm performing now, each locale starts with 25,000 records of length 100B. The record type contains a 10B key (defined as 10 uint(8)s) and 90B of data (defined as 90 uint(8)s). Locale 0's comm diagnostics are:

(get = 338915, put = 32, execute_on = 120, execute_on_fast = 20, execute_on_nb = 24)

where the comm diagnostics are reset and started before the above line of code, and stopped and printed immediately after. This seems like an extraordinarily high number of get operations for what amounts to transferring approximately 1.8MB of data between nodes. In MPI, there's more flexibility from the runtime, which has the ability to buffer larger segments of data and requires significantly fewer than 338,915 operations to retrieve all the data from the other nodes. A back-of-the-napkin estimation would suggest that each of the get operations which Chapel performs transmits only an average of ~6B of data.

Is there a way in Chapel to get a more sensible amount of data to be passed back and forth? I had hoped that sticking to higher-level slices like this would allow the compiler to optimize the communication of contiguous array accesses, but this doesn't seem to be the case.


Solution

  • Offline, we have conjectured that the primary reason so many communications were generated for this case was due to the structure of the arrays—specifically the use of arrays of records containing array fields themselves. To illustrate this, consider the following examples.

    In this first example, we declare an array of real floating point values, migrate the original task onto a distinct locale, and then copy the array into a new, local array. We wrap this copy in some CommDiagnostics calls to measure the number of communications required. Because the array's elements are simple and stored in a flat, contiguous buffer, our expectation is that the array elements should be able to be communicated in bulk.

    use CommDiagnostics;
    
    config const n = 1000;
    
    var A: [1..n] real;
    
    on Locales[numLocales-1] {
      startCommDiagnostics();
      var B: [1..n] real;
    
      B = A;
      stopCommDiagnostics();
      printCommDiagnosticsTable();
    }
    

    Running this example on two locales and at a few problem sizes, we see that the number of communications required is not proportional to the array size:

    $ chpl --fast testit.chpl
    $ ./testit -nl 2 --n=1000
    | locale | get | get_nb | cache_get_hits | cache_get_misses |
    | -----: | --: | -----: | -------------: | ---------------: |
    |      0 |   0 |      0 |              0 |                0 |
    |      1 |   1 |      3 |              3 |                3 |
    
    $ ./testit -nl 2 --n=1000000
    | locale | get | get_nb | cache_get_hits | cache_get_misses |
    | -----: | --: | -----: | -------------: | ---------------: |
    |      0 |   0 |      0 |              0 |                0 |
    |      1 |   1 |      3 |              3 |                3 |
    

    However, if we change the arrays' elements from a simple type to a more complex one, like a record with array fields, then more communication is required. The reason for this is that arrays in Chapel are implemented using a heap-allocated class for metadata pointing to a heap-allocated buffer to store the array elements. As a result, the data for arrays A and B are no longer stored contiguously in memory, and can't be trivially transferred in-place using a single communication as in the previous case.

    use CommDiagnostics;
    
    config const n = 1000;
    
    record R {
      var A: [1..3] real;
    }
    
    var A: [1..n] R;
    
    on Locales[numLocales-1] {
      startCommDiagnostics();
      var B: [1..n] R;
    
      B = A;
      stopCommDiagnostics();
      printCommDiagnosticsTable();
    }
    

    As a result, the amount of communication grows proportionally to the array sizes:

    $ ./testit1a -nl 2 --n=1000
    | locale | get | get_nb | cache_get_hits | cache_get_misses |
    | -----: | --: | -----: | -------------: | ---------------: |
    |      0 |   0 |      0 |              0 |                0 |
    |      1 |   1 |   4003 |           1001 |             4003 |
    
    $ ./testit1a -nl 2 --n=1000000
    | locale | get |  get_nb | cache_get_hits | cache_get_misses |
    | -----: | --: | ------: | -------------: | ---------------: |
    |      0 |   0 |       0 |              0 |                0 |
    |      1 |   1 | 4000003 |        1000001 |          4000003 |
    

    However, if we make sure to use fields whose data types are stored in-place rather than on the heap, such as tuples, then the remote transfer becomes independent of the array size again:

    use CommDiagnostics;
    
    config const n = 1000;
    
    record R {
      var A: 3*real;
    }
    
    var A: [1..n] R;
    
    on Locales[numLocales-1] {
      startCommDiagnostics();
      var B: [1..n] R;
    
      B = A;
      stopCommDiagnostics();
      printCommDiagnosticsTable();
    }
    
    $ ./testit1b -nl 2 --n=1000
    | locale | get | get_nb | cache_get_hits | cache_get_misses |
    | -----: | --: | -----: | -------------: | ---------------: |
    |      0 |   0 |      0 |              0 |                0 |
    |      1 |   1 |      3 |              1 |                3 |
    
    $ ./testit1b -nl 2 --n=1000000
    | locale | get | get_nb | cache_get_hits | cache_get_misses |
    | -----: | --: | -----: | -------------: | ---------------: |
    |      0 |   0 |      0 |              0 |                0 |
    |      1 |   1 |      3 |              1 |                3 |
    

    Another example of in-place data structure in Chapel is the c_array type, which declares a statically-sized in-place C-style array.

    These results are arguably simply an indication of Chapel's status today, as of version 1.27.0. With more work, the implementation could arguably create buffers for arrays with heap-allocated elements, as in the second example, and utilize fewer communications to transfer them between locales. Or, arrays whose sizes are statically known, as in this example, could arguably be allocated in-place rather than on the heap, enabling the current optimization.

    For users who want to use Chapel array fields today, some options for exerting more control over the amount of communication required might be to copy the contents of their heap-based data structures into a flat, unstructured array as in the first and third examples, and use a copy to do the communication. Or to use the MPI or Communication modules to express explicit communication between array buffers and locales as in a traditional C+MPI programming model