Getting Started with Agoge#
Agoge is a scalable, async reinforcement learning (RL) framework built on Ray, Hydra, and PyTorch. It is designed for distributed RL experiments, supporting modular agents and environments, and chat-based RL schemas.
Prerequisites#
- Python 3.12+
- Ray
- Hydra
- uv
- Node.js 22+ (for markdownlint-cli2, recommended via nvm, see Installation)
- markdownlint-cli2 (for markdown linting)
Installation#
-
Clone the repository:
git clone <repo-url> cd agoge -
Install dependencies:
uv sync --dev --group docs -
(If you need secrets/API keys) Create a
.envedit it, and ignore it in yourgit:cp .env.example .env echo ".env" >> .git/info/exclude-
Note
.envis not in.gitignorebecauseray submitrespects it and will not package it during job submission -
Tip Your
.envvariables will be loaded automatically whenstart.pyruns. If you want to use them in your config with hydra you can use${oc:ENV_NAME}
-
-
(Optional, recommended) Install Node.js via nvm for markdownlint-cli2: This will enable you to run markdownlint locally without pre-commit, otherwise pre-commit manages its own environment.
# Install nvm (Node Version Manager) curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | bash # Install Node.js LTS (Jod - v22.x) nvm install --lts=jod nvm use --lts=jod # Verify installation node --version # Should show v22.x.xNote: Node.js is required for markdownlint-cli2, which runs as part of pre-commit hooks. We use the "Jod" LTS release (v22.x) to match the version specified in
.pre-commit-config.yaml. -
(for development) Install pre-commit hooks:
uv run pre-commit installThis will install all pre-commit hooks including: - trailing-whitespace, end-of-file-fixer - ruff-check and ruff-format (Python linting) - markdownlint-cli2 (Markdown linting)
Running Distributed RL with Ray#
For detailed information about Ray cluster setup and management, see Ray Cluster Setup.
There are two main approaches to running distributed workloads with Ray:
Option 1: One-step Launch (Most Common)#
This approach starts a Ray cluster, runs your command, and exits when done:
# Launch Ray cluster, run your command, and exit when done (all in one step)
sbatch hpc/launch_with_ray.sh uv run src/agoge/entrypoints/eval.py paths=gcs
# Run with multiple nodes
sbatch --nodes=2 hpc/launch_with_ray.sh uv run src/agoge/entrypoints/eval.py
Option 2: Persistent Cluster#
Alternatively, you can start a persistent Ray cluster and then submit jobs to it:
-
Start Ray on your cluster:
# Option 1: Launch a persistent Ray cluster via SLURM (attached mode) sbatch hpc/launch_ray.slurm # Option 2: Start a Ray cluster in an interactive session (attached mode) ./hpc/launch_ray.sh --mode=attached -
Set the Ray API server address (replace with your head node's IP). This requires the dashboard running:
# Set the environment variable for ray job submit export RAY_API_SERVER_ADDRESS='http://<head-node-ip>:8265' # For Python ray.init() (if using ray directly, not ray job submit): export RAY_ADDRESS='<head-node-ip>:6388' # No http://, different port -
Submit a job:
-
If you are using gcs, run
# Using the environment variable set above uv run ray job submit \ --working-dir . -- uv run src/agoge/entrypoints/eval.py paths=gcs # OR specify the address directly uv run ray job submit \ --address 'http://<head-node-ip>:8265' \ --working-dir . -- uv run src/agoge/entrypoints/eval.py paths=gcs -
If you are using aws to run the job, make sure that the Mind2Web tasks are used by passing:
uv run ray job submit \ --address 'http://<head-node-ip>:8265' \ --working-dir . -- uv run src/agoge/entrypoints/eval.py task_loader=mind2web paths=aws -
On AWS you might need this env variable for training:
LD_LIBRARY_PATH: /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu:/opt/amazon/efa/lib - You can also copy the output logs to another path (e.g., NFS). This happens in the end of the script. The path has to be accessible from ray's head node.
-
Here is an example where the
$pwdexpands to the path you are currently in for exampleagoge/on the NFS:uv run ray job submit \ --working-dir . -- uv run src/agoge/start.py paths.persist_root=$(pwd)
-
Linting, Formatting, and Testing#
Python Linting (Ruff)#
uv run ruff check --fix .
uv run ruff format .
Markdown Linting (markdownlint-cli2)#
Run markdown linting manually:
# Lint all markdown files
markdownlint-cli2 "**/*.md"
# Lint specific directories
markdownlint-cli2 "{src,tests,docs}/**/*.md"
# Fix auto-fixable issues
markdownlint-cli2 --fix "**/*.md"
Configuration: Rules are defined in .markdownlint.json at the project root.
Pre-commit Hooks#
Run all pre-commit hooks (Python + Markdown):
# Run on staged files only
uv run pre-commit run
# Run on all files
uv run pre-commit run --all-files
# Run specific hook
uv run pre-commit run markdownlint-cli2 --all-files
Tests#
uv run pytest
GitHub Integration#
All linting checks (ruff and markdownlint-cli2) run automatically on pull requests via GitHub Actions. The workflow is defined in .github/workflows/pre-commit.yaml.
Project Structure#
src/agoge/agent/: Agent interface and implementationssrc/agoge/environment/: Environment interface and implementationssrc/agoge/runner.py: Orchestrates agent-environment interactionsrc/agoge/trainer.py: Training loop and data pipelinesrc/agoge/inference.py: Inference manager and worker classessrc/agoge/schema/: Data structures for chat, messages, trajectoriessrc/agoge/start.py: Main entry point for distributed training