Getting Started with Agoge#

Agoge is a scalable, async reinforcement learning (RL) framework built on Ray, Hydra, and PyTorch. It is designed for distributed RL experiments, supporting modular agents and environments, and chat-based RL schemas.

Prerequisites#

Python 3.12+
Ray
Hydra
uv
Node.js 22+ (for markdownlint-cli2, recommended via nvm, see Installation)
markdownlint-cli2 (for markdown linting)

Installation#

Clone the repository:
```
git clone <repo-url>
cd agoge
```
Install dependencies:
```
uv sync --dev --group docs
```
(If you need secrets/API keys) Create a .env edit it, and ignore it in your git:
```
cp .env.example .env
echo ".env" >> .git/info/exclude
```
- Note .env is not in .gitignore because ray submit respects it and will not package it during job submission
- Tip Your .env variables will be loaded automatically when start.py runs. If you want to use them in your config with hydra you can use ${oc:ENV_NAME}
(Optional, recommended) Install Node.js via nvm for markdownlint-cli2: This will enable you to run markdownlint locally without pre-commit, otherwise pre-commit manages its own environment.
```
# Install nvm (Node Version Manager)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | bash

# Install Node.js LTS (Jod - v22.x)
nvm install --lts=jod
nvm use --lts=jod

# Verify installation
node --version  # Should show v22.x.x
```
Note: Node.js is required for markdownlint-cli2, which runs as part of pre-commit hooks. We use the "Jod" LTS release (v22.x) to match the version specified in .pre-commit-config.yaml.
(for development) Install pre-commit hooks:
```
uv run pre-commit install
```
This will install all pre-commit hooks including: - trailing-whitespace, end-of-file-fixer - ruff-check and ruff-format (Python linting) - markdownlint-cli2 (Markdown linting)

Running Distributed RL with Ray#

For detailed information about Ray cluster setup and management, see Ray Cluster Setup.

There are two main approaches to running distributed workloads with Ray:

Option 1: One-step Launch (Most Common)#

This approach starts a Ray cluster, runs your command, and exits when done:

# Launch Ray cluster, run your command, and exit when done (all in one step)
sbatch hpc/launch_with_ray.sh uv run src/agoge/entrypoints/eval.py paths=gcs

# Run with multiple nodes
sbatch --nodes=2 hpc/launch_with_ray.sh uv run src/agoge/entrypoints/eval.py

Option 2: Persistent Cluster#

Alternatively, you can start a persistent Ray cluster and then submit jobs to it:

Start Ray on your cluster:

# Option 1: Launch a persistent Ray cluster via SLURM (attached mode)
sbatch hpc/launch_ray.slurm

# Option 2: Start a Ray cluster in an interactive session (attached mode)
./hpc/launch_ray.sh --mode=attached

Set the Ray API server address (replace with your head node's IP). This requires the dashboard running:

# Set the environment variable for ray job submit
export RAY_API_SERVER_ADDRESS='http://<head-node-ip>:8265'

# For Python ray.init() (if using ray directly, not ray job submit):
export RAY_ADDRESS='<head-node-ip>:6388'  # No http://, different port

Submit a job:

If you are using gcs, run

# Using the environment variable set above
uv run ray job submit \
    --working-dir . -- uv run src/agoge/entrypoints/eval.py paths=gcs

# OR specify the address directly
uv run ray job submit \
    --address 'http://<head-node-ip>:8265' \
    --working-dir . -- uv run src/agoge/entrypoints/eval.py paths=gcs

If you are using aws to run the job, make sure that the Mind2Web tasks are used by passing:

uv run ray job submit \
    --address 'http://<head-node-ip>:8265' \
    --working-dir . -- uv run src/agoge/entrypoints/eval.py task_loader=mind2web paths=aws

On AWS you might need this env variable for training: LD_LIBRARY_PATH: /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu:/opt/amazon/efa/lib
You can also copy the output logs to another path (e.g., NFS). This happens in the end of the script. The path has to be accessible from ray's head node.

Here is an example where the $pwd expands to the path you are currently in for example agoge/ on the NFS:

uv run ray job submit \
    --working-dir . -- uv run src/agoge/start.py paths.persist_root=$(pwd)

Linting, Formatting, and Testing#

Python Linting (Ruff)#

uv run ruff check --fix .
uv run ruff format .

Markdown Linting (markdownlint-cli2)#

Run markdown linting manually:

# Lint all markdown files
markdownlint-cli2 "**/*.md"

# Lint specific directories
markdownlint-cli2 "{src,tests,docs}/**/*.md"

# Fix auto-fixable issues
markdownlint-cli2 --fix "**/*.md"

Configuration: Rules are defined in .markdownlint.json at the project root.

Pre-commit Hooks#

Run all pre-commit hooks (Python + Markdown):

# Run on staged files only
uv run pre-commit run

# Run on all files
uv run pre-commit run --all-files

# Run specific hook
uv run pre-commit run markdownlint-cli2 --all-files

Tests#

uv run pytest

GitHub Integration#

All linting checks (ruff and markdownlint-cli2) run automatically on pull requests via GitHub Actions. The workflow is defined in .github/workflows/pre-commit.yaml.

Project Structure#

src/agoge/agent/: Agent interface and implementations
src/agoge/environment/: Environment interface and implementations
src/agoge/runner.py: Orchestrates agent-environment interaction
src/agoge/trainer.py: Training loop and data pipeline
src/agoge/inference.py: Inference manager and worker classes
src/agoge/schema/: Data structures for chat, messages, trajectories
src/agoge/start.py: Main entry point for distributed training