pythonamazon-ec2theanokerasipython-parallel

Ipyparallel module load on a cluster extremely slow


I have created a 48-node cluster consisting of host0 to host47 (all nodes are g2.2xlarge Amazon EC2 instances with no NFS). According to https://ipyparallel.readthedocs.io/en/latest/process.html, I have created a controller on host0 and 47 engines on host1 to host47. I have replicated most of the configuration for the ssh ipyparallel cluster from the StarCluster project (but, as I have said, without NFS). The cluster works and seems to produce correct results, but loading modules sometimes takes very long. For instance,

import ipyparallel as ipp
client = ipp.Client('/path/to/ipcontroller-client.json',sshkey='mykey')
view = client[:]
view.block=True

with view.sync_imports():
    import time
    import numpy
    from keras.models import Sequential
    from keras.layers import Dense, Dropout
    from keras.regularizers import l1
    from keras.optimizers import SGD
    from subprocess import check_output

takes more than 30 minutes until it is finished. This does not change if I change to block=False and view.wait(). Also using view.execute("import time; import numpy; import keras.models ...") does not help. I know that loading keras modules is somewhat slow, but on my local machine it's usually done in less than 1 minute. I have tried both pickle and json (un)packing. I should mention that the loading of the modules works fine when I use the same cluster for another calculation. I guess the loaded modules are somewhere cached. But when I terminate the instances, create new ones and configure a new ipyparallel cluster, I have the same issues with module loading.

Looking into the ipcontroller log, I can find that most of the requests corresponding to sync_imports

2016-08-25 12:12:02.310 [IPControllerApp] queue::client '\x00"_\x0b\x0b'
submitted request '46244cf0-ad0a-4748-a84c-8d3d69d8252c' to 0

get finished within a few minutes. However, a few of them take about 30mins. See the following histogram of complete_time - submit_time as derived from the ipcontroller log.

enter image description here

I have just recently started to use python and I have no idea what the problem could be here. It seems that the max time difference between complete and submit time increases with the cluster size. Any pointers to possible issues highly welcome.

BTW: I am using Python 2.7.6 and Ipyparallel 5.1.1


Solution

  • My best guess for now is that the problem was caused by the initialization of EBS volumes -- which can be a bit slow sometimes. The cluster instances were always started from images and terminated right after the computations were done. EBS volumes created from snapshots have to fetch their data from S3. See the AWS EBS documentation

    New EBS volumes receive their maximum performance the moment that they are available and do not require initialization (formerly known as pre-warming). However, storage blocks on volumes that were restored from snapshots must be initialized (pulled down from Amazon S3 and written to the volume) before you can access the block. This preliminary action takes time and can cause a significant increase in the latency of an I/O operation the first time each block is accessed. For most applications, amortizing this cost over the lifetime of the volume is acceptable. Performance is restored after the data is accessed once.