linuxamazon-web-serviceschapel

AWS multilocale Installation Seems incorrect


I set up a 12 EC2 instance "cluster" on AWS running CentOS 7. The nodes share a common NFS file-system, and have separate boot volumes where their home directories reside. I installed Chapel on the NFS file-system for multilocale use. I can share the installation steps if helpful. Gmake seems to run without errors, but gmake check does not give an error free output. Also, I can't run multilocale examples if the Chapel program is on a local file-system. Is that correct?

02/03/2021 My bad. Just noticed this in the documentation. "and copy the compiled binary onto all of the EC2 instances, under the same path." But still trying to figure out why gmake check fails.

Once installed, if I set GASNET_SPAWNFN=L, gmake check gives the following output:

.
.
.
Hello, world! (from locale 0 of 4)
Hello, world! (from locale 3 of 4)
Hello, world! (from locale 2 of 4)
Hello, world! (from locale 1 of 4)
GASNET: Exiting after AMUDP_SPMDExit(0)...
gmake: *** [Makefile:209: check] Error 20

If I set GASNET_SPAWNFN=S, gmake check never returns. gmake check --debug=v gives the following output:

.
.
.
updating makefiles....
Updating goal targets....
Considering target file 'all'.
 File 'all' does not exist.
  Considering target file 'FORCE'.
   File 'FORCE' does not exist.
   Finished prerequisites of target file 'FORCE'.
  Must remake target 'FORCE'.
  Successfully remade target file 'FORCE'.
 Finished prerequisites of target file 'all'.
Must remake target 'all'.
Successfully remade target file 'all'.
    Successfully remade target file '/tmp/chpl-centos-11886.deleteme/hello6-taskpar-dist.tmp'.
   Finished prerequisites of target file 'all'.
  Must remake target 'all'.
  Successfully remade target file 'all'.
 Finished prerequisites of target file 'default'.
Must remake target 'default'.
Successfully remade target file 'default'.
gmake: *** [Makefile:209: check] Error 1

If I compile examples/hello6-taskpar-dist.chpl on the shared NFS file system (hpl -o hello6 $CHPL_HOME/examples/hello6-taskpar-dist.chpl) and run ./hello6 -nl 12, it returns:

.
.
.
Hello, world! (from locale 0 of 12 named ip-10-xxx-yy-311.evoforge.org)
Hello, world! (from locale 7 of 12 named ip-10-xxx-yy-348.evoforge.org)
Hello, world! (from locale 10 of 12 named ip-10-xxx-yy-362.evoforge.org)
Hello, world! (from locale 9 of 12 named ip-10-xxx-yy-322.evoforge.org)
Hello, world! (from locale 8 of 12 named ip-10-xxx-yy-316.evoforge.org)
Hello, world! (from locale 6 of 12 named ip-10-xxx-yy-331.evoforge.org)
Hello, world! (from locale 1 of 12 named ip-10-xxx-yy-335.evoforge.org)
Hello, world! (from locale 3 of 12 named ip-10-xxx-yy-353.evoforge.org)
Hello, world! (from locale 5 of 12 named ip-10-xxx-yy-317.evoforge.org)
Hello, world! (from locale 11 of 12 named ip-10-xxx-yy-358.evoforge.org)
Hello, world! (from locale 4 of 12 named ip-10-xxx-yy-364.evoforge.org)
Hello, world! (from locale 2 of 12 named ip-10-xxx-yy-344.evoforge.org)
GASNET: Exiting after AMUDP_SPMDExit(0)...

(ip numbers obfuscated by me)

This seems correct.

But if I run compile and run ./hello6 -nl 12 -v in a local home directory it hangs here:

.
.
.
bash: line 0: cd: /home/centos/chapel: No such file or directory
env: /home/centos/chapel/hello6_real: No such file or directory
env: /home/centos/chapel/hello6_real: No such file or directory
GASNET: slave connecting to 10.xxx.yy.311:43233
ENV parameter: GASNET_LINEBUFFERSZ = 1024                       (default)
GASNET: slave using IP 10.xxx.yy.311

Here is the output from printchplenv:

$ util/printchplenv
machine info: Linux ip-10-xxx-xx-311.evoforge.org 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC 2020 x86_64
CHPL_HOME: /nfs/software/chapel-1.23.0 *
script location: /nfs/software/chapel-1.23.0/util/chplenv
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: unknown
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet *
  CHPL_COMM_SUBSTRATE: udp
  CHPL_GASNET_SEGMENT: everything
CHPL_TASKS: qthreads
CHPL_LAUNCHER: amudprun
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: gmp
CHPL_HWLOC: hwloc
CHPL_REGEXP: re2
CHPL_LLVM: none
CHPL_AUX_FILESYS: none

Please let me know if I can provide any additional information. Thanks for the help.

02/03/2021 Here is some additional information. This is the final build procedure I followed.

Installed the listed prerequisites for CentOS 7. Versions are listed. Did not install llvm, as it did not list that as a prerequisite.

chapel - 1.23.0
gcc - 8.3.1 20190311
m4 - (GNU M4) 1.4.16
perl - v5.16.3
python - 2.7.5
bash - GNU bash, version 4.2.46(2)
gmake - GNU Make 4.2.1
gawk - GNU Awk 4.0.2

Added the following lines to the .bashrc on each node (IP addresses are obfuscated):

export CHPL_HOME=/nfs/software/chapel-1.23.0
source /nfs/software/chapel-1.23.0/util/setchplenv.bash
export CHPL_COMM=gasnet
export GASNET_SPAWNFN=S
export GASNET_SSH_SERVERS="10.xxx.yy.311 10.xxx.yy.316 10.xxx.yy.317 10.xxx.yy.322 10.xxx.yy.331 10.xxx.yy.335 10.xxx.yy.344 10.xxx.yy.348 10.xxx.yy.353 10.xxx.yy.358 10.xxx.yy.362 10.xxx.yy.364"

Enable password-less ssh between nodes by generating RSA key pairs on each node (ssh-keygen -t rsa). Copied public key generated on all nodes into the .ssh/authorized_keys file on every node.

Then built it. The first time it complained that I should use gmake instead of make.

cd chapel-1.23.0
gmake
gmake check

Again, thanks in advance for your help.


Solution

  • By default, make check assumes $HOME is shared across nodes, which is the case on most HPC systems. During the test, it creates a temporary directory as a destination for the compiled test program: $HOME/.chpl. Because your home directory is not NFS-mounted, the make check fails.

    You can override the temporary directory used for make check by setting CHPL_CHECK_INSTALL_DIR. If you point that environment variable to an NFS-mounted path, make check should work.