Async Computer-Use RL — Design Overview#

This document describes the minimal viable architecture for a scalable, asynchronous reinforcement learning (RL) system designed for computer-using agents that interact with graphical user interfaces (GUI), command-line interfaces (CLI), and web applications.

Core Schemas#

See the Reference Schema for detailed definitions of Chat, TimeStep, and Trajectory.

Chat #

A Chat represents a single LLM interaction and consists of a list[ChatMessage]. For training purposes, chats must include logprobs (log probabilities) to enable gradient computation.

TimeStep #

Each TimeStep consists of one or more Chat objects. Multiple chats per timestep are necessary because a single interaction with the environment might require multiple LLM calls (e.g., first generating a summary, then requesting an action).

Trajectory #

A Trajectory represents a complete episode and consists of a list of TimeStep objects. When a trajectory is complete (i.e., the last TimeStep.done == True), it is sent to the trainer for processing.

System Architecture#

Runners (Ray Remotes)#

Multiple Runners operate asynchronously across the Ray cluster, each managing one agent-environment pair. Each runner orchestrates the interaction loop between its agent and environment, producing complete trajectories. The runner is the only Ray actor in this interaction path—the agent and environment now live inside the runner process so per-step communication happens in-process with minimal overhead.

@ray.remote(num_cpus=1)
class Runner:
    def play_episode(self, reset_kwargs: dict) -> Trajectory:
        # Reset environment, then loop: agent.act() → env.step() → record timestep
        # Returns complete trajectory when done

# Multiple runners managed by ActorPool for parallel execution
runners = [Runner.remote(env_cfg, agent_cfg, inference_mngr) for _ in range(num_runners)]
pool = ActorPool(runners)
for _ in range(num_episodes):
    # run `num_runners` in parallel to collect `Trajectory` objects.
    # push `Trajectory` objects to traj_in_queue

Environment#

Each environment provides:

Observation: Current state (screenshots, text, etc.)
Step(action): Executes action and returns (reward, done, info)
- action is a single ChatMessage, containing a tool call
Reset(): Resets to initial state

Agent#

The agent is stateless and uses the InferenceManager for LLM access:

Step(observation, inference_manager): Returns (list[Chat], ChatMessage)
- list[Chat]: Training data with logprobs
- ChatMessage: Action to send to environment
Should interact with LLM through self.inference_mngr

Inference Manager + Workers#

See here

Trainer#

Consumes completed trajectories asynchronously
Computes gradients and updates model weights
Pushes new weights to inference manager