Skip to content

Getting Started with Agoge#

Agoge is a scalable, async reinforcement learning (RL) framework built on Ray, Hydra, and PyTorch. It is designed for distributed RL experiments, supporting modular agents and environments, and chat-based RL schemas.

Prerequisites#

Installation#

  1. Clone the repository:

    git clone <repo-url>
    cd agoge
    
  2. Install dependencies:

    uv sync --dev --group docs
    
  3. (If you need secrets/API keys) Create a .env edit it, and ignore it in your git:

    cp .env.example .env
    echo ".env" >> .git/info/exclude
    
    • Note .env is not in .gitignore because ray submit respects it and will not package it during job submission

    • Tip Your .env variables will be loaded automatically when start.py runs. If you want to use them in your config with hydra you can use ${oc:ENV_NAME}

  4. (Optional, recommended) Install Node.js via nvm for markdownlint-cli2: This will enable you to run markdownlint locally without pre-commit, otherwise pre-commit manages its own environment.

    # Install nvm (Node Version Manager)
    curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | bash
    
    # Install Node.js LTS (Jod - v22.x)
    nvm install --lts=jod
    nvm use --lts=jod
    
    # Verify installation
    node --version  # Should show v22.x.x
    

    Note: Node.js is required for markdownlint-cli2, which runs as part of pre-commit hooks. We use the "Jod" LTS release (v22.x) to match the version specified in .pre-commit-config.yaml.

  5. (for development) Install pre-commit hooks:

    uv run pre-commit install
    

    This will install all pre-commit hooks including: - trailing-whitespace, end-of-file-fixer - ruff-check and ruff-format (Python linting) - markdownlint-cli2 (Markdown linting)

Running Distributed RL with Ray#

For detailed information about Ray cluster setup and management, see Ray Cluster Setup.

There are two main approaches to running distributed workloads with Ray:

Option 1: One-step Launch (Most Common)#

This approach starts a Ray cluster, runs your command, and exits when done:

# Launch Ray cluster, run your command, and exit when done (all in one step)
sbatch hpc/launch_with_ray.sh uv run src/agoge/entrypoints/eval.py paths=gcs

# Run with multiple nodes
sbatch --nodes=2 hpc/launch_with_ray.sh uv run src/agoge/entrypoints/eval.py

Option 2: Persistent Cluster#

Alternatively, you can start a persistent Ray cluster and then submit jobs to it:

  1. Start Ray on your cluster:

    # Option 1: Launch a persistent Ray cluster via SLURM (attached mode)
    sbatch hpc/launch_ray.slurm
    
    # Option 2: Start a Ray cluster in an interactive session (attached mode)
    ./hpc/launch_ray.sh --mode=attached
    
  2. Set the Ray API server address (replace with your head node's IP). This requires the dashboard running:

    # Set the environment variable for ray job submit
    export RAY_API_SERVER_ADDRESS='http://<head-node-ip>:8265'
    
    # For Python ray.init() (if using ray directly, not ray job submit):
    export RAY_ADDRESS='<head-node-ip>:6388'  # No http://, different port
    
  3. Submit a job:

    • If you are using gcs, run

      # Using the environment variable set above
      uv run ray job submit \
          --working-dir . -- uv run src/agoge/entrypoints/eval.py paths=gcs
      
      # OR specify the address directly
      uv run ray job submit \
          --address 'http://<head-node-ip>:8265' \
          --working-dir . -- uv run src/agoge/entrypoints/eval.py paths=gcs
      
    • If you are using aws to run the job, make sure that the Mind2Web tasks are used by passing:

      uv run ray job submit \
          --address 'http://<head-node-ip>:8265' \
          --working-dir . -- uv run src/agoge/entrypoints/eval.py task_loader=mind2web paths=aws
      
    • On AWS you might need this env variable for training: LD_LIBRARY_PATH: /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu:/opt/amazon/efa/lib

    • You can also copy the output logs to another path (e.g., NFS). This happens in the end of the script. The path has to be accessible from ray's head node.
    • Here is an example where the $pwd expands to the path you are currently in for example agoge/ on the NFS:

      uv run ray job submit \
          --working-dir . -- uv run src/agoge/start.py paths.persist_root=$(pwd)
      

Linting, Formatting, and Testing#

Python Linting (Ruff)#

uv run ruff check --fix .
uv run ruff format .

Markdown Linting (markdownlint-cli2)#

Run markdown linting manually:

# Lint all markdown files
markdownlint-cli2 "**/*.md"

# Lint specific directories
markdownlint-cli2 "{src,tests,docs}/**/*.md"

# Fix auto-fixable issues
markdownlint-cli2 --fix "**/*.md"

Configuration: Rules are defined in .markdownlint.json at the project root.

Pre-commit Hooks#

Run all pre-commit hooks (Python + Markdown):

# Run on staged files only
uv run pre-commit run

# Run on all files
uv run pre-commit run --all-files

# Run specific hook
uv run pre-commit run markdownlint-cli2 --all-files

Tests#

uv run pytest

GitHub Integration#

All linting checks (ruff and markdownlint-cli2) run automatically on pull requests via GitHub Actions. The workflow is defined in .github/workflows/pre-commit.yaml.

Project Structure#

  • src/agoge/agent/: Agent interface and implementations
  • src/agoge/environment/: Environment interface and implementations
  • src/agoge/runner.py: Orchestrates agent-environment interaction
  • src/agoge/trainer.py: Training loop and data pipeline
  • src/agoge/inference.py: Inference manager and worker classes
  • src/agoge/schema/: Data structures for chat, messages, trajectories
  • src/agoge/start.py: Main entry point for distributed training