Inference Manager#

The inference manager routes every model request through a single Ray actor so that agent code only worries about InferenceRequest objects, not how to reach OpenAI, Anthropic, or local vLLM workers. The actor launches backends on initialization and keeps them alive until you explicitly shut it down.

Heads-up Rate limiting and metric sinks are stubs today; they will land in a follow-up pass.

Configuration#

The manager is usually created through Hydra. The snippet below mirrors the integration tests and demonstrates both vLLM and API-backed models:

inference:
  _target_: agoge.inference_manager.create_inference_manager
  models:
    Qwen/Qwen3-0.6B:
      backend: vllm
      num_workers: 1
      worker_cfg:
        _target_: agoge.vllm_inference.InferenceWorker
        engine_args:
          _target_: vllm.AsyncEngineArgs
          model: Qwen/Qwen3-0.6B
          tensor_parallel_size: 1
          distributed_executor_backend: ray
          dtype: bfloat16
          generation_config: vllm
    gpt-4o-mini:
      backend: api
      client: openai
      client_config:
        _target_: agoge.client.openai.OpenAIClient
        config:
          _target_: agoge.client.openai.OpenAIClientConfig
          model_id: gpt-4o-mini
          api_key: ${oc.env:OPENAI_API_SECRET_KEY}
    claude-sonnet-4-5-20250929:
      backend: api
      client: anthropic
      client_config:
        _target_: agoge.client.anthropic.AnthropicClient
        config:
          _target_: agoge.client.anthropic.AnthropicClientConfig
          model_id: claude-sonnet-4-5-20250929
          api_key: ${oc.env:ANTHROPIC_KEY}
  clients: {}

Ensure GPUs are available when scheduling vLLM workers, and populate OPENAI_API_SECRET_KEY/ANTHROPIC_KEY before launching API models.

Composing Providers#

In practice, you compose pre-configured model groups using Hydra overrides rather than defining everything inline:

# Single provider
uv run src/agoge/entrypoints/eval.py inference_manager=openai

# Multiple providers
uv run src/agoge/entrypoints/eval.py inference_manager=[openai,anthropic]

# vLLM-backed model during training
uv run src/agoge/entrypoints/rl.py model=qwen3-0.6B inference_manager=[intraining]

The intraining configuration automatically syncs the inference model with the training model specified by model=. This allows the training loop to use the same model for trajectory collection via vLLM workers.

See configs/inference_manager for available provider configurations.

Parameter Synchronization During Training#

When using inference_manager=[intraining]:

Synchronization Modes#

Three modes are available via param_sync_mode:

1. nonzero3_low_mem (Non-ZeRO-3, default)

Direct parameter access from training model
Use when: Not using DeepSpeed ZeRO-3
Memory: Low overhead
Speed: Fast
Implementation: _sync_params_nonzero3_one_by_one()

2. zero3_high_mem (ZeRO-3)

Gathers all parameters at once before iterating
Use when: Memory allows, need fastest sync
Memory: High (all parameters in memory simultaneously)
Speed: Fastest
Implementation: _sync_params_zero3_all_at_once()

3. zero3_low_mem (ZeRO-3)

Gathers each parameter individually
Use when: Memory constrained, can tolerate slower sync
Memory: Low (only one parameter gathered at a time)
Speed: Slower
Implementation: _sync_params_zero3_one_by_one()

See rl.py

Synchronization Configuration#

# configs/rl.yaml
param_sync_mode: nonzero3_low_mem  # or zero3_low_mem / zero3_high_mem for ZeRO-3
param_sync_every_n_steps: 4        # Sync every N optimization steps (respects gradient accumulation)

DeepSpeed ZeRO-3 Considerations#

With ZeRO-3, parameters are sharded across GPUs. The sync process:

Gathers parameters from all GPUs (using gathered_parameters_for_sync)
Each rank handles a subset of parameters (round-robin by index)
Packs tensors and sends to inference workers via Ray
Barrier synchronization (implicit on gather) ensures all ranks complete together

Usage Examples#

Bootstrapping the manager#

import os
import ray
from omegaconf import OmegaConf

from agoge.inference_manager import create_inference_manager, get_inference_manager
from agoge.schema.inference_request import InferenceRequest
from agoge.schema.msgs import SystemMessage, UserMessage
from agoge.schema.trajectories import Chat

# Optional for vLLM cache placement
os.environ.setdefault("SCRATCH_HOME", "/tmp")

ray.init()

cfg = OmegaConf.create({
    "models": {
        "Qwen/Qwen3-0.6B": {
            "backend": "vllm",
            "num_workers": 1,
            "worker_cfg": {
                "_target_": "agoge.vllm_inference.InferenceWorker",
                "engine_args": {
                    "_target_": "vllm.AsyncEngineArgs",
                    "model": "Qwen/Qwen3-0.6B",
                    "tensor_parallel_size": 1,
                    "distributed_executor_backend": "ray",
                    "dtype": "bfloat16",
                    "generation_config": "vllm",
                },
            },
        }
    },
    "clients": {},
})

# Create the singleton actor
_ = create_inference_manager(models=cfg.models, clients=cfg.clients, num_cpus=4)
manager = get_inference_manager()

Dispatching a chat completion#

chat = Chat(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="Say 'Hello!' and nothing else."),
    ]
)

request = InferenceRequest(
    messages=chat,
    model="Qwen/Qwen3-0.6B",
    temperature=0.7,
    max_tokens=64,
)

response = ray.get(manager.create_chat_completion.remote(request))
print(response.choices[0].message.content)

# When finished, release GPU memory
ray.get(manager.shutdown.remote())
ray.shutdown()

Switching to an API model#

from agoge.schema.inference_request import InferenceRequest

request = InferenceRequest(
    messages=chat,
    model="gpt-4o-mini",
    temperature=0.7,
    max_tokens=64,
)

response = ray.get(manager.create_chat_completion.remote(request))

The manager automatically maps model="gpt-4o-mini" to the OpenAI client configured above. Anthropic models follow the same pattern.