Distributed Array Access Communication Cost

I'm finishing up implementing a sort of "Terasort Lite" program in Chapel, based on a distributed bucket sort, and I'm noticing what seems to be significant performance bottlenecks around access to a block-distributed array. A rough benchmark shows Chapel takes ~7 seconds to do the whole sort of 3.5MB with 5 locales, while the original MPI+C program does it in around 8.2ms with 5 processes. My local machine has 16 cores, so I don't need to oversubscribe MPI to get 5 processes working.

The data to be sorted are loaded into a block-distributed array across each of the locales so that each locale has an even (and contiguous) share of the unsorted records. In an MPI+C bucket sort, each process would have its records in memory and sort those local records. To that end, I've written a locale-aware implementation of qsort (based on the C stdlib implementation), and this is where I see extreme performance bottlenecks. The overall bucket sort procedure takes a reference to a block distributed array, and qsort is called with the local subdomain: qsort(records[records.localSubdomain()]) from within the coforall block and on loc do clause.

My main question is how Chapel maintains coherence on distributed arrays, and whether any type of coherence actions across locales are what's obliterating my performance. I've checked, and each locale's qsort call is only ever accessing array indices within its local subdomain; I would expect that this means that no communication is required, since each locale accesses only the portion of the domain that it owns. Is this a reasonable expectation, or is this simultaneous access to private portions of a distributed array causing communication overhead?

For context, I am running locally on one physical machine using the UDP GASNET communication substrate, and the documentation notes that this is not expected to give good performance. Nonetheless, I do plan to move the code to a cluster with InfiniBand, so I'd still like to know if I should approach the problem a different way to get better performance. Please let me know if there's any other information that would help this question be answered. Thank you!

Solution

Thanks for your question. I can answer some of the questions here.

First, I'd like to point out some other distributed sort implementations in Chapel:

The distributed sort in Arkouda
distributedPartitioningSortWithScratchSpace I have not worked on this in a while but if I recall correctly it is a distributed sample sort backed by a not-in-place radix sort and I think it is tested so it should run
I have other partial work trying to port over ips4o but that algorithm is pretty complicated and I didn't get all the way through and it's not running yet.

Generally I would expect a radix sort to outperform a quick sort for the local problems unless they are very small.

Now to your questions:

My main question is how Chapel maintains coherence on distributed arrays, and whether any type of coherence actions across locales are what's obliterating my performance.

There is a cache for remote data that is on by default. It can be disabled with --no-cache-remote when compiling but I suspect it is not the problem here. In particular, it mainly does any coherence activities on some sort of memory fence (which includes on statement, task end, use of sync/atomic variables). But, you can turn it off and see if that changes things.

Distributed arrays and domains currently use an eager privatization strategy. That means that once they are created, some elements of the data structure is replicated across all locales. Since this involves all locales, it can cause performance problems when running multilocale.

You can check for communication within your kernel with the CommDiagnostics module or with the local block. The CommDiagnostics module will allow you to count or trace communication events while the local block will halt your program if communication is attempted within it.

Another possibility is that the compiler is not generating communication but it is running slower because it has trouble optimizing when the data might be remote. The indicator that this is the problem would be that the performance you get when compiling with CHPL_COMM=none is significantly faster than when running with 1 locale with gasnet and UDP. (You could alternatively use --local and --no-local flags to compare). Some ways to potentially help that:

instead of records[records.localSubdomain()], you could try records.localSlice(records.localSubdomain()) but that uses an undocumented feature. I do not know why it is undocumented, though.
using a local block within your computation should solve this as well but note that we generally try to solve the problem in other ways since the local block is a big hammer.

Slicing has more overhead than we would like. See e.g. https://github.com/chapel-lang/chapel/issues/13317 . As I said, there might also be privatization costs (I don't remember what the current situation of slicing and privatization is, off-hand). In my experience, for local sorting code, you are better off passing a start and end argument as ints, or maybe a range; but slicing to get the local part of the array is certainly more important in the distributed setting.

Lastly, you mentioned that you're trying to analyze the performance when running oversubscribed. If you haven't already seen it, check out this documentation about oversubscription that recommends a setting.