pythonsshmpimpi4pyarchlinux-arm

Host key verification failed using mpi4py


I am building a MPI application using mpi4py (1.3.1) and openmpi (1.8.6-1) in Arch Linux ARM (on a Raspberry Pi cluster, to be more specific). I've run my program successfully on 3 nodes (4 processes), and when trying to add a new node, here's what happens:

Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

The funny thing is, the ssh keys are fine, since I'm using the same nodes (I can remove any entry of the host file, add the new node, and it will work, so I am pretty sure that the problem is not with a misconfigured ssh setup. It only happens when I use 5 processes).

Could this be a bug in the library of some sort?

Here's my host file

192.168.1.26 slots=2
192.168.1.188 slots=1
#192.168.1.202 slots=1 If uncommented and run with -np 5, it will raise the error
192.168.1.100 slots=1

Thanks in advance!


Solution

  • I was having the same problem on a Linux x86_64 mini cluster running Fedora 22 and OpenMPI 1.8. I could SSH into any of my 5 machines from my launch machine, but when I tried to launch MPI with 3 or more nodes, it would give me an authentication error. And like you it seemed like 3 is a magic number, and it turns out that it is. OpenMPI uses a tree-based launch, and so when you have more than two nodes, one or more of the intermediate nodes are executing an ssh. In my case, I was not using a password-less setup. I had an SSH identity on the launch machine that I had added to my key chain. It was able to launch the first two nodes because I had that authenticated identity in my key chain. Then each of those nodes tried to launch more and those nodes did not have that key authenticated (I would have need to add it on each of them).

    So the solution appears to be moving to a password-less SSH identity setup, but you obviously have to be careful how you do that. I created a specific identity (key pair) on my launch machine. I added the key to my authorized keys on the nodes I want to use (which is easy since they are all using NFS, but you could manually distribute the key once if you need to). Then I modified my SSH config to use that password-less identity when trying to go to my node machines. My ~/.ssh/config looks like:

    Host node0
        HostName node0
        IdentityFile ~/.ssh/passwordless_rsa
    Host node1
        HostName node1
        IdentityFile ~/.ssh/passwordless_rsa
    ...
    

    I'm sure there is some way to scale this for N nodes with wildcards. Or you could consider changing the default identity file at the system level in system ssh config file (I bet a similar option is available there).

    And that did the trick. Now I can spin up all 5 nodes without any authentication issues. The flaw in my thinking was that launch node would launch all the other nodes, but this tree based launch means you need to chain logins, which you cannot do with a passphrase authenticated identity since you never get the chance to authenticate it.

    Having a password-less key still freaks me out, so to keep things extra safe on these nodes connected to an open network, I changed the sshd config (system level) to restrict logins to anyone except me coming from my launch node.