I have created a 48-node cluster consisting of host0
to host47
(all nodes are g2.2xlarge
Amazon EC2 instances with no NFS).
According to https://ipyparallel.readthedocs.io/en/latest/process.html, I have created a controller on host0
and 47 engines on host1
to host47
. I have replicated most of the configuration for the ssh
ipyparallel
cluster from the StarCluster project (but, as I have said, without NFS).
The cluster works and seems to produce correct results, but loading modules sometimes takes very long.
For instance,
import ipyparallel as ipp
client = ipp.Client('/path/to/ipcontroller-client.json',sshkey='mykey')
view = client[:]
view.block=True
with view.sync_imports():
import time
import numpy
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.regularizers import l1
from keras.optimizers import SGD
from subprocess import check_output
takes more than 30 minutes until it is finished. This does not change if I change to block=False
and view.wait()
. Also using view.execute("import time; import numpy; import keras.models ...")
does not help. I know that loading keras
modules is somewhat slow, but on my local machine it's usually done in less than 1 minute. I have tried both pickle
and json
(un)packing.
I should mention that the loading of the modules works fine when I use the same cluster for another calculation. I guess the loaded modules are somewhere cached. But when I terminate the instances, create new ones and configure a new ipyparallel
cluster, I have the same issues with module loading.
Looking into the ipcontroller
log, I can find that most of the requests corresponding to sync_imports
2016-08-25 12:12:02.310 [IPControllerApp] queue::client '\x00"_\x0b\x0b'
submitted request '46244cf0-ad0a-4748-a84c-8d3d69d8252c' to 0
get finished within a few minutes. However, a few of them take about 30mins. See the following histogram of complete_time - submit_time
as derived from the ipcontroller
log.
I have just recently started to use python and I have no idea what the problem could be here. It seems that the max time difference between complete and submit time increases with the cluster size. Any pointers to possible issues highly welcome.
BTW: I am using Python 2.7.6 and Ipyparallel 5.1.1
My best guess for now is that the problem was caused by the initialization of EBS volumes -- which can be a bit slow sometimes. The cluster instances were always started from images and terminated right after the computations were done. EBS volumes created from snapshots have to fetch their data from S3. See the AWS EBS documentation
New EBS volumes receive their maximum performance the moment that they are available and do not require initialization (formerly known as pre-warming). However, storage blocks on volumes that were restored from snapshots must be initialized (pulled down from Amazon S3 and written to the volume) before you can access the block. This preliminary action takes time and can cause a significant increase in the latency of an I/O operation the first time each block is accessed. For most applications, amortizing this cost over the lifetime of the volume is acceptable. Performance is restored after the data is accessed once.