Skip to content

Ray Cluster Setup#

This document provides detailed information about Ray cluster setup and management in Agoge.

Overview of Ray Cluster Management#

Agoge uses Ray for distributed training and evaluation. The Ray launch scripts have been designed to support different operational workflows while maintaining a consistent interface. All scripts must be run from the repository root directory.

Launch Scripts Architecture#

The Ray launch scripts consist of several components:

  • launch_utils.sh: Core utilities for environment setup, Ray cluster management, and job submission
  • launch_ray.sh: Unified script with --mode parameter to support different flows
  • launch_with_ray.sh: SBATCH wrapper for submit flow (start, run command, exit)
  • launch_ray.slurm: SBATCH wrapper for attached flow (start and keep running)

Usage Workflows#

Agoge supports three primary workflows for Ray cluster management:

Flow 1: SBATCH + Execute (Submit Mode)#

This workflow starts a Ray cluster, runs a command, and exits when the command completes.

# Start cluster, run command, exit when done
sbatch hpc/launch_with_ray.sh uv run src/agoge/entrypoints/eval.py paths=gcs

# Scale to multiple nodes
sbatch --nodes=2 --gpus-per-node=8 hpc/launch_with_ray.sh uv run src/agoge/entrypoints/eval.py paths=gcs

# Run tests with Ray
sbatch hpc/launch_with_ray.sh uv run pytest tests/integration/test_sft.py -v -s

Flow 2: Interactive/Attached Mode#

This workflow starts a Ray cluster in an interactive session and keeps it running, allowing you to submit multiple jobs to the same cluster.

# Start Ray cluster in attached mode (interactive session)
./hpc/launch_ray.sh --mode=attached

# For explicit environment setting
./hpc/launch_ray.sh --mode=attached --env=gcp_slurm

When the cluster starts, you'll see connection instructions in the console output, including:

  • The Ray dashboard URL for job submission (RAY_API_SERVER_ADDRESS)
  • The address for ray.init() in Python code (RAY_ADDRESS)

Flow 3: SBATCH + Standby Cluster#

This workflow submits a SLURM job that starts a Ray cluster and keeps it running in attached mode, allowing you to submit jobs to it from any terminal.

# Start a persistent Ray cluster via SLURM
sbatch hpc/launch_ray.slurm

After starting the cluster, check the log file to find connection information:

# View connection information
cat slurm_logs/latest.out

Environment Configuration#

Key Environment Variables#

The Ray launch scripts use several environment variables that can be configured:

Variable Default Description
RAY_TCP_PORT 6388 TCP port for Ray head (for ray.init)
RAY_DASHBOARD_PORT 8265 HTTP port for Ray dashboard (for ray job submit)
RAY_API_SERVER_ADDRESS - URL for Ray job submission (http://<host>:<port>)
RAY_TCP_ADDRESS - Internal TCP address variable (not used directly by ray.init())
RAY_ADDRESS - Official Ray environment variable for ray.init() (host:port)
ENABLE_GCP_ENV true Enable NCCL configuration for GCP environment
UV_VENV_SEED false/true Whether to seed UV venv (false in launch_ray.sh, true in launch_with_ray.sh)

Connection Methods#

There are two primary ways to connect to a Ray cluster:

  1. For ray job submit (CLI):
# Set environment variable
export RAY_API_SERVER_ADDRESS='http://<head-node-ip>:8265'
ray job submit --working-dir . -- uv run src/script.py

# OR specify directly
ray job submit --address 'http://<head-node-ip>:8265' --working-dir . -- uv run src/script.py
  1. For ray.init() in Python code:
# Method 1: Automatic connection (recommended)
import ray
ray.init(address="auto")

# Method 2: Manual connection via environment variable
# First set the environment variable:
# export RAY_ADDRESS='<head-node-ip>:6388'  # This is the official Ray environment variable
import ray
ray.init()  # Will automatically use RAY_ADDRESS environment variable

Environment Types#

Currently, the scripts support the gcp_slurm environment type, which sets up appropriate cache directories:

  • SCRATCH_HOME - Base scratch directory
  • TRITON_CACHE_DIR - Triton kernel cache
  • HF_HOME - Hugging Face cache
  • UV_CACHE_DIR - UV package cache

The environment type is specified with the --env parameter to launch_ray.sh or as the first parameter to the main() function in launch_utils.sh.