amazon-ec2parallel-processingcluster-computinghpcray

Workers not being launched on EC2 by Ray


I am using the Ray module to launch an Ubuntu (16.04) cluster on AWS EC2. In the configuration I specified min_workers, max_workers and initial_workers as 2, because I do not need any auto-sizing. I also want a t2.micro master-node and c4.8xlarge workers. The cluster launches, but only the master (the following terminal output is from the ray installation onwards, .... minus details):-

2019-04-18 14:52:48,462 INFO updater.py:268 -- NodeUpdater: Running pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl on 54.226.178.23...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Collecting ray==0.7.0.dev2 from https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
Downloading https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl (56.2MB)
.....
.....
Successfully built pyyaml
Installing collected packages: click, colorama, six, redis, typing, filelock, flatbuffers, numpy, pyyaml, more-itertools, setuptools, attrs, atomicwrites, pluggy, py, pathlib2, pytest, funcsigs, ray
Successfully installed atomicwrites attrs click colorama filelock flatbuffers funcsigs more-itertools numpy pathlib2 pluggy py pytest pyyaml-3.11 ray redis setuptools-20.7.0 six-1.10.0 typing
You are using pip version 8.1.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2019-04-18 14:53:32,656 INFO updater.py:268 -- NodeUpdater: Running pip3    install boto3==1.4.8 on 54.226.178.23...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Collecting boto3==1.4.8
Downloading https://files.pythonhosted.org/packages/7d/09/66fef826fb13a2cee74a1df56c269d2794a90ece49c3b77113b733e4b91d/boto3-1.4.8-
....
....
Installing collected packages: docutils, jmespath, six, python-dateutil, botocore, s3transfer, boto3
Successfully installed boto3-1.4.8 botocore-1.8.50 docutils-0.14 jmespath-0.9.4 python-dateutil-2.8.0 s3transfer-0.1.13 six-1.12.0
You are using pip version 8.1.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2019-04-18 14:53:37,805 INFO updater.py:268 -- NodeUpdater: Running ray stop on 54.226.178.23...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
WARNING: Not monitoring node memory since `psutil` is not installed.  Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
2019-04-18 14:53:39,775 INFO updater.py:268 -- NodeUpdater: Running ulimit -n 65536; ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml on 54.226.178.23...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-04-18 18:53:40,167 INFO scripts.py:288 -- Using IP address 172.31.7.117 for this node.
2019-04-18 18:53:40,167 INFO node.py:469 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-18_18-53-40_7981/logs.
2019-04-18 18:53:40,271 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:6379 to respond...
2019-04-18 18:53:40,389 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:60491 to respond...
2019-04-18 18:53:40,390 INFO services.py:804 -- Starting Redis shard with 0.21 GB max memory.
2019-04-18 18:53:40,400 INFO node.py:483 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-18_18-53-40_7981/logs.
2019-04-18 18:53:40,410 INFO services.py:1439 -- Starting the Plasma object store with 0.31 GB memory using /dev/shm.
2019-04-18 18:53:40,421 WARNING services.py:907 -- Failed to start the reporter. The reporter requires 'pip install psutil'.
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
2019-04-18 18:53:40,425 INFO scripts.py:319 -- 
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --redis-address 172.31.7.117:6379

from the node you wish to add. You can connect a driver to the cluster from Python by running

import ray
ray.init(redis_address="172.31.7.117:6379")

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

ray stop
2019-04-18 14:53:40,593 INFO log_timer.py:21 -- NodeUpdater: i-064f62badf69f8cee: Setup commands completed [LogTimer=115941ms]
2019-04-18 14:53:40,593 INFO log_timer.py:21 -- NodeUpdater: i-064f62badf69f8cee: Applied config 248f16e493ac5bcd753a673eb7202fa2b49e0f9f  [LogTimer=173814ms]
2019-04-18 14:53:40,973 INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-node-status=up-to-date on ['i-064f62badf69f8cee'] [LogTimer=374ms]
2019-04-18 14:53:41,069 INFO commands.py:264 -- get_or_create_head_node:  Head node up-to-date, IP address is: 54.226.178.23
To monitor auto-scaling activity, you can run:

  ray exec ray_config.yaml  'tail -n 100 -f /tmp/ray/session_*/logs/monitor*'

To open a console on the cluster:

  ray attach ray_config.yaml

To ssh manually to the cluster, run:

  ssh -i /home/haines/.ssh/ray-autoscaler_us-east-1.pem ubuntu@54.226.178.23

2019-04-18 14:53:41,181 INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-runtime-config=248f16e493ac5bcd753a673eb7202fa2b49e0f9f on ['i-064f62badf69f8cee'] 

I used the standard configuration (example-full.yaml) with the following changes:-

min_workers: 2

initial_workers: 2

    type: aws
    region: us-east-1
    availability_zone: us-east1a,us-east-1b


head_node:
    InstanceType: t2.micro
    ImageId: ami-0565af6e282977273 # ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20190212

worker_nodes:
    InstanceType: c4.8xlarge
    ImageId: ami-0f9cf087c1f27d9b1 # ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20181114  

        #MarketType: spot

setup_commands:

- echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >>     ~/.bashrc
    - sudo apt-get update
    - sudo apt-get install python3-pip
    - pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl

    - pip3 install boto3==1.4.8  # 1.4.8 adds InstanceMarketOptions

Latest failed setup:-

setup_commands:
- sudo apt-get update
- wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true 1>/dev/null
- bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true 1>/dev/null
- echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
- sudo pkill -9 apt-get || true
- sudo pkill -9 dpkg || true
- sudo dpkg --configure -a
- sudo apt-get install python3-pip || true
- pip3 install --upgrade pip
- pip3 install --user psutil
- pip3 install --user proctitle
- pip3 install --user ray
- pip3 install --user boto3==1.4.8
- pip3 install --user https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl

Solution

  • I ran a slightly modified version of the config you posted, and this works for me

    cluster_name: test
    
    min_workers: 2
    
    initial_workers: 2
    
    provider:
        type: aws
        region: us-east-1
        availability_zone: us-east1a,us-east-1b
    
    head_node:
        InstanceType: t2.micro
        ImageId: ami-0565af6e282977273 # ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20190212
    
    worker_nodes:
        InstanceType: c4.8xlarge
        ImageId: ami-0f9cf087c1f27d9b1 # ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20181114
            #MarketType: spot
    
    setup_commands:
        - sudo apt-get update
        # Install Anaconda.
        - wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true
        - bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true
        - echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
        # Install Ray.
        - pip install ray
        - pip install boto3==1.4.8  # 1.4.8 adds InstanceMarketOptions
    

    The only real difference I think is installing Anaconda Python and putting in on the PATH so that pip finds it properly. I suspect the issue was related to not finding the right version of Python.