Inference Manager#
The inference manager routes every model request through a single Ray actor so that agent code only worries about InferenceRequest objects, not how to reach OpenAI, Anthropic, or local vLLM workers. The actor launches backends on initialization and keeps them alive until you explicitly shut it down.
Heads-up Rate limiting and metric sinks are stubs today; they will land in a follow-up pass.
Configuration#
The manager is usually created through Hydra. The snippet below mirrors the integration tests and demonstrates both vLLM and API-backed models:
inference:
_target_: agoge.inference_manager.create_inference_manager
models:
Qwen/Qwen3-0.6B:
backend: vllm
num_workers: 1
worker_cfg:
_target_: agoge.vllm_inference.InferenceWorker
engine_args:
_target_: vllm.AsyncEngineArgs
model: Qwen/Qwen3-0.6B
tensor_parallel_size: 1
distributed_executor_backend: ray
dtype: bfloat16
generation_config: vllm
gpt-4o-mini:
backend: api
client: openai
client_config:
_target_: agoge.client.openai.OpenAIClient
config:
_target_: agoge.client.openai.OpenAIClientConfig
model_id: gpt-4o-mini
api_key: ${oc.env:OPENAI_API_SECRET_KEY}
claude-sonnet-4-5-20250929:
backend: api
client: anthropic
client_config:
_target_: agoge.client.anthropic.AnthropicClient
config:
_target_: agoge.client.anthropic.AnthropicClientConfig
model_id: claude-sonnet-4-5-20250929
api_key: ${oc.env:ANTHROPIC_KEY}
clients: {}
Ensure GPUs are available when scheduling vLLM workers, and populate OPENAI_API_SECRET_KEY/ANTHROPIC_KEY before launching API models.
Composing Providers#
In practice, you compose pre-configured model groups using Hydra overrides rather than defining everything inline:
# Single provider
uv run src/agoge/entrypoints/eval.py inference_manager=openai
# Multiple providers
uv run src/agoge/entrypoints/eval.py inference_manager=[openai,anthropic]
# vLLM-backed model during training
uv run src/agoge/entrypoints/rl.py model=qwen3-0.6B inference_manager=[intraining]
The intraining configuration automatically syncs the inference model with the training model specified by model=. This allows the training loop to use the same model for trajectory collection via vLLM workers.
See configs/inference_manager for available provider configurations.
Parameter Synchronization During Training#
When using inference_manager=[intraining]:
Synchronization Modes#
Three modes are available via param_sync_mode:
1. nonzero3_low_mem (Non-ZeRO-3, default)
- Direct parameter access from training model
- Use when: Not using DeepSpeed ZeRO-3
- Memory: Low overhead
- Speed: Fast
- Implementation:
_sync_params_nonzero3_one_by_one()
2. zero3_high_mem (ZeRO-3)
- Gathers all parameters at once before iterating
- Use when: Memory allows, need fastest sync
- Memory: High (all parameters in memory simultaneously)
- Speed: Fastest
- Implementation:
_sync_params_zero3_all_at_once()
3. zero3_low_mem (ZeRO-3)
- Gathers each parameter individually
- Use when: Memory constrained, can tolerate slower sync
- Memory: Low (only one parameter gathered at a time)
- Speed: Slower
- Implementation:
_sync_params_zero3_one_by_one()
See rl.py
Synchronization Configuration#
# configs/rl.yaml
param_sync_mode: nonzero3_low_mem # or zero3_low_mem / zero3_high_mem for ZeRO-3
param_sync_every_n_steps: 4 # Sync every N optimization steps (respects gradient accumulation)
DeepSpeed ZeRO-3 Considerations#
With ZeRO-3, parameters are sharded across GPUs. The sync process:
- Gathers parameters from all GPUs (using
gathered_parameters_for_sync) - Each rank handles a subset of parameters (round-robin by index)
- Packs tensors and sends to inference workers via Ray
- Barrier synchronization (implicit on gather) ensures all ranks complete together
Usage Examples#
Bootstrapping the manager#
import os
import ray
from omegaconf import OmegaConf
from agoge.inference_manager import create_inference_manager, get_inference_manager
from agoge.schema.inference_request import InferenceRequest
from agoge.schema.msgs import SystemMessage, UserMessage
from agoge.schema.trajectories import Chat
# Optional for vLLM cache placement
os.environ.setdefault("SCRATCH_HOME", "/tmp")
ray.init()
cfg = OmegaConf.create({
"models": {
"Qwen/Qwen3-0.6B": {
"backend": "vllm",
"num_workers": 1,
"worker_cfg": {
"_target_": "agoge.vllm_inference.InferenceWorker",
"engine_args": {
"_target_": "vllm.AsyncEngineArgs",
"model": "Qwen/Qwen3-0.6B",
"tensor_parallel_size": 1,
"distributed_executor_backend": "ray",
"dtype": "bfloat16",
"generation_config": "vllm",
},
},
}
},
"clients": {},
})
# Create the singleton actor
_ = create_inference_manager(models=cfg.models, clients=cfg.clients, num_cpus=4)
manager = get_inference_manager()
Dispatching a chat completion#
chat = Chat(
messages=[
SystemMessage(content="You are a helpful assistant."),
UserMessage(content="Say 'Hello!' and nothing else."),
]
)
request = InferenceRequest(
messages=chat,
model="Qwen/Qwen3-0.6B",
temperature=0.7,
max_tokens=64,
)
response = ray.get(manager.create_chat_completion.remote(request))
print(response.choices[0].message.content)
# When finished, release GPU memory
ray.get(manager.shutdown.remote())
ray.shutdown()
Switching to an API model#
from agoge.schema.inference_request import InferenceRequest
request = InferenceRequest(
messages=chat,
model="gpt-4o-mini",
temperature=0.7,
max_tokens=64,
)
response = ray.get(manager.create_chat_completion.remote(request))
The manager automatically maps model="gpt-4o-mini" to the OpenAI client configured above. Anthropic models follow the same pattern.