Ray Cluster Setup#
This document provides detailed information about Ray cluster setup and management in Agoge.
Overview of Ray Cluster Management#
Agoge uses Ray for distributed training and evaluation. The Ray launch scripts have been designed to support different operational workflows while maintaining a consistent interface. All scripts must be run from the repository root directory.
Launch Scripts Architecture#
The Ray launch scripts consist of several components:
- launch_utils.sh: Core utilities for environment setup, Ray cluster management, and job submission
- launch_ray.sh: Unified script with
--modeparameter to support different flows - launch_with_ray.sh: SBATCH wrapper for submit flow (start, run command, exit)
- launch_ray.slurm: SBATCH wrapper for attached flow (start and keep running)
Usage Workflows#
Agoge supports three primary workflows for Ray cluster management:
Flow 1: SBATCH + Execute (Submit Mode)#
This workflow starts a Ray cluster, runs a command, and exits when the command completes.
# Start cluster, run command, exit when done
sbatch hpc/launch_with_ray.sh uv run src/agoge/entrypoints/eval.py paths=gcs
# Scale to multiple nodes
sbatch --nodes=2 --gpus-per-node=8 hpc/launch_with_ray.sh uv run src/agoge/entrypoints/eval.py paths=gcs
# Run tests with Ray
sbatch hpc/launch_with_ray.sh uv run pytest tests/integration/test_sft.py -v -s
Flow 2: Interactive/Attached Mode#
This workflow starts a Ray cluster in an interactive session and keeps it running, allowing you to submit multiple jobs to the same cluster.
# Start Ray cluster in attached mode (interactive session)
./hpc/launch_ray.sh --mode=attached
# For explicit environment setting
./hpc/launch_ray.sh --mode=attached --env=gcp_slurm
When the cluster starts, you'll see connection instructions in the console output, including:
- The Ray dashboard URL for job submission (
RAY_API_SERVER_ADDRESS) - The address for ray.init() in Python code (
RAY_ADDRESS)
Flow 3: SBATCH + Standby Cluster#
This workflow submits a SLURM job that starts a Ray cluster and keeps it running in attached mode, allowing you to submit jobs to it from any terminal.
# Start a persistent Ray cluster via SLURM
sbatch hpc/launch_ray.slurm
After starting the cluster, check the log file to find connection information:
# View connection information
cat slurm_logs/latest.out
Environment Configuration#
Key Environment Variables#
The Ray launch scripts use several environment variables that can be configured:
| Variable | Default | Description |
|---|---|---|
RAY_TCP_PORT |
6388 | TCP port for Ray head (for ray.init) |
RAY_DASHBOARD_PORT |
8265 | HTTP port for Ray dashboard (for ray job submit) |
RAY_API_SERVER_ADDRESS |
- | URL for Ray job submission (http://<host>:<port>) |
RAY_TCP_ADDRESS |
- | Internal TCP address variable (not used directly by ray.init()) |
RAY_ADDRESS |
- | Official Ray environment variable for ray.init() (host:port) |
ENABLE_GCP_ENV |
true | Enable NCCL configuration for GCP environment |
UV_VENV_SEED |
false/true | Whether to seed UV venv (false in launch_ray.sh, true in launch_with_ray.sh) |
Connection Methods#
There are two primary ways to connect to a Ray cluster:
- For ray job submit (CLI):
# Set environment variable
export RAY_API_SERVER_ADDRESS='http://<head-node-ip>:8265'
ray job submit --working-dir . -- uv run src/script.py
# OR specify directly
ray job submit --address 'http://<head-node-ip>:8265' --working-dir . -- uv run src/script.py
- For ray.init() in Python code:
# Method 1: Automatic connection (recommended)
import ray
ray.init(address="auto")
# Method 2: Manual connection via environment variable
# First set the environment variable:
# export RAY_ADDRESS='<head-node-ip>:6388' # This is the official Ray environment variable
import ray
ray.init() # Will automatically use RAY_ADDRESS environment variable
Environment Types#
Currently, the scripts support the gcp_slurm environment type, which sets up appropriate cache directories:
SCRATCH_HOME- Base scratch directoryTRITON_CACHE_DIR- Triton kernel cacheHF_HOME- Hugging Face cacheUV_CACHE_DIR- UV package cache
The environment type is specified with the --env parameter to launch_ray.sh or as the first parameter to the main() function in launch_utils.sh.