ipythonzeromqpyzmqipython-parallel

ipyparallel displaying "registration: purging stalled registration"


I am trying to use the ipyparallel library to run an ipcontroller and ipengine on different machines.

My setup is as follows:

Remote machine: Windows Server 2012 R2 x64, running an ipcontroller, listening on port 5900 and ip=0.0.0.0.

Local machine: Windows 10 x64, running an ipengine, listening on the remote machine's ip and port 5900.

Controller start command: ipcontroller --ip=0.0.0.0 --port=5900 --reuse --log-to-file=True

Engine start command: ipengine --file=/c/Users/User/ipcontroller-engine.json --timeout=10 --log-to-file=True

I've changed the interface field in ipcontroller-engine.json from "tcp://127.0.0.1" to "tcp://" for ipengine.

On startup, here is a snapshot of the ipcontroller log:

2016-10-10 01:14:00.651 [IPControllerApp] Hub listening on tcp://0.0.0.0:5900 for registration. 2016-10-10 01:14:00.677 [IPControllerApp] Hub using DB backend: 'DictDB' 2016-10-10 01:14:00.956 [IPControllerApp] hub::created hub 2016-10-10 01:14:00.957 [IPControllerApp] task::using Python leastload Task scheduler 2016-10-10 01:14:00.959 [IPControllerApp] Heartmonitor started 2016-10-10 01:14:00.967 [IPControllerApp] Creating pid file: C:\Users\Administrator\.ipython\profile_default\pid\ipcontroller.pid 2016-10-10 01:14:02.102 [IPControllerApp] client::client b'\x00\x80\x00\x00)' requested 'connection_request' 2016-10-10 01:14:02.102 [IPControllerApp] client::client [b'\x00\x80\x00\x00)'] connected 2016-10-10 01:14:47.895 [IPControllerApp] client::client b'82f5efed-52eb-46f2-8c92-e713aee8a363' requested 'registration_request' 2016-10-10 01:15:05.437 [IPControllerApp] client::client b'efe6919d-98ac-4544-a6b8-9d748f28697d' requested 'registration_request' 2016-10-10 01:15:17.899 [IPControllerApp] registration::purging stalled registration: 1

And the ipengine log:

2016-10-10 13:44:21.037 [IPEngineApp] Registering with controller at tcp://172.17.3.14:5900 2016-10-10 13:44:21.508 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms. 2016-10-10 13:44:21.522 [IPEngineApp] Completed registration with id 1 2016-10-10 13:44:27.529 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row). 2016-10-10 13:44:30.539 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (2 time(s) in a row). ... 2016-10-10 13:46:52.009 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (49 time(s) in a row). 2016-10-10 13:46:55.028 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (50 time(s) in a row). 2016-10-10 13:46:55.028 [IPEngineApp] CRITICAL | Maximum number of heartbeats misses reached (50 times 3010 ms), shutting down.

(There is a 12.5 hour time difference between the local machine and the remote VM)

Any idea why this may happen?


Solution

  • If you are using --reuse, make sure to remove the files if you change settings. It's possible that it doesn't behave well when --reuse is given and you change things like --ip, as the connection file may be overriding your command-line arguments.

    When setting --ip=0.0.0.0, it may be useful to also set --location=a.b.c.d where a.b.c.d is an ip address of the controller that you know is accessible to the engines. Changing the

    If registration works and subsequent connections don't, this may be due to a firewall only opening one port, e.g. 5900. The machine running the controller needs to have all the ports listed in the connection file open. You can specify these to be a port-range by manually entering port numbers in the connection files.