Deploy Ray Cluster on HAL (Beta)

Note

This is a beta version of deployment. If you have any problems, please go to HAL Slack channel for help.

Warning

Always remember to tear down your Ray cluster after using!

These are the instructions for setting up a Ray cluster with autoscaling capability on HAL. With autoscaling, the Ray cluster tries to add more “nodes” to the existing Ray cluster by submitting new Slurm jobs and tries to disconnect “nodes” by cancelling Slurm jobs when idle.

Deployment Instructions

The overall deployment process is:

  1. Get the Ray library.

  2. Modify the Ray library to support autoscaler on HAL.

  3. Configure launch-specific parameters.

The following sections provide step-by-step instructions for each piece of the process.

Get the Ray library into your Private Environment

Because Ray is not directly available on HAL and cannot be installed directly from pip, you need to modify the Ray library based on the one provided by Open-CE.

  1. Load opence/1.6.1, using:

    module load opence/1.6.1
    
  2. Clone this module to your own environment:

    conda create --name <env_name> --clone opence/1.6.1
    

    You can use any environment name you like (make sure you remember it). This step can take about 30 minutes.

    The path to ray after you’ve cloned the environment looks something like:

    /home/<username>/.conda/envs/<env_name>/lib/python3.9/site-packages/ray
    
  3. Activate the environment:

    conda activate <env_name>
    

Configure bash to Load Modules Automatically

Ray autoscaler requires conda to be accessible when a shell is opened. To do so, you need to modify the ~/.bashrc file, which is automatically executed every time a bash shell is opened.

  1. Run the following command:

    vi ~/.bashrc
    

    This will open the bashrc file using vim.

  2. Press i to edit the file and add these lines under #User specific environment:

    module load opence/1.6.1
    
  3. Press escape to get out of edit mode.

  4. Enter :qw to save and quit.

Deploy the Ray-Slurm Autoscaler Module into Ray library

  1. Download the autoscaler code and deployment script from the Ray-SLURM-autoscaler repository by running:

    git clone https://github.com/TingkaiLiu/Ray-SLURM-autoscaler.git
    
  2. Several changes on the deployment script are needed specifically for the HAL cluster. Open deploy.py and change the following:

    • Set the RAY_PATH to be the path to your Ray library: /home/<username>/.conda/envs/<env_name>/lib/python3.9/site-packages/ray

    • Set the SLURM_IP_LOOKUP table to:

      SLURM_IP_LOOKUP = """{
          "hal01" : "192.168.20.1",
          "hal02" : "192.168.20.2",
          "hal03" : "192.168.20.3",
          "hal04" : "192.168.20.4",
          "hal05" : "192.168.20.5",
          "hal06" : "192.168.20.6",
          "hal07" : "192.168.20.7",
          "hal08" : "192.168.20.8",
          "hal09" : "192.168.20.9",
          "hal10" : "192.168.20.10",
          "hal11" : "192.168.20.11",
          "hal12" : "192.168.20.12",
          "hal13" : "192.168.20.13",
          "hal14" : "192.168.20.14",
          "hal15" : "192.168.20.15",
          "hal16" : "192.168.20.16",
      }"""
      
    • Change HEAD and WORKER CPUS/GPUS according to the partition you want to use. See Original Slurm Style - Available Queues. If you want to run the Ray head node outside Slurm, set the CPUS/GPUS of the head node to be 0.

  3. Change slurm/worker.slurm:

    • Change line 4 from:

      #SBATCH --gpus-per-task=[\_DEPLOY_WORKER_GPUS\_]

      to:

      #SBATCH --gres=gpu:[\_DEPLOY_WORKER_GPUS\_]
      
    • Under line set -x, add:

      SLURM_GPUS_PER_TASK="[\_DEPLOY_WORKER_GPUS\_]
      
  4. Change slurm/head.slurm

    • Change line 4 from:

      #SBATCH --gpus-per-task=[\_DEPLOY_HEAD_GPUS\_]

      to:

      #SBATCH --gres=gpu:[\_DEPLOY_HEAD_GPUS\_]
      
    • Under line set -x add:

      SLURM_GPUS_PER_TASK="[\_DEPLOY_HEAD_GPUS\_]
      
  5. Run deploy.py:

    python3 deploy.py
    

    This should generate the ray-slurm.yaml file for cluster launching.

    At this point, the Ray autoscaler should be enabled.

Configuration for Specific Cluster Launch

After the module is deployed, you may want a different configuration every time you launch a Ray cluster. For example, you may want to change the maximum number of nodes in your cluster for different workloads. Those launch-specific changes only require changes to the cluster config yaml file.

  1. In the config yaml file:

    1. Change init_command on lines 43 and 60 to be activating your own environment.

    2. Set under_slurm: on line 37 to 0 or 1 based on your need. (See the GitHub Doc for explanation.)

    3. If you don’t have a reservation, comment out lines 45 and 62 -> lines that say - #SBATCH --reservation=username (make sure the comment lines up correctly with rest of code)

    4. Change lines 46 and 63 according to the partition you want to use.

  2. To start the Ray cluster, run:

    ray up ray-slurm.yaml --no-config-cache
    

    If under_slurm is set to be 1, It is required that there is at least one idle node. Otherwise, the launching process keeps retrying until there is an idle node.

    If you force-terminated the launching process, run ray down ray-slurm.yaml to perform garbage collection.

    If the ray up command runs successfully, the Ray cluster with autoscaling functionality should be started at this point.

  3. To connect to the Ray cluster in your Python code:

    • If you launch head outside Slurm (under_slurm = 0), use:

      ray.init(address="192.168.20.203:<gcs_port>", redis_password="<The password generated at start time>")
      
    • If you launch head inside Slurm (under_slurm = 1), you need to find the head node ip from the print out message and use Ray client to connect to it:

      ray.init(address="ray://<head_ip>:<gcs_port>", redis_password="<The password generated at start time>")
      
  4. Warning

    Always remember to tear down your Ray cluster after using!

    To tear down the Ray cluster, run:

    ray down ray-slurm.yaml
    

Testing

To check whether the Ray cluster is autoscaling when you launch a heavy workload with Ray:

  • Check whether the output messages start with (scheduler

  • Run squeue -u <username> to see if new Slurm jobs are submitted automatically under your username.

  • Check the .out file produced by Slurm.

Acknowledgment: This document includes contributions by Tingkai Liu, Will Tegge, and Arnav Mehta.