NVIDIA researchers have introduced ProRL Agent, a scalable infrastructure designed for reinforcement learning training of multi-turn LLM agents. The system adopts a “Rollout-as-a-Service” philosophy that decouples agentic rollout orchestration from the training loop, addressing resource conflicts between I/O-intensive environment interactions and GPU-intensive policy updates.
Current frameworks including SkyRL, VeRL-Tool, Agent Lightning, rLLM, and GEM embed rollout control directly within the training process. This tight coupling creates conflicting system requirements where rollouts are I/O-bound, requiring sandbox creation and asynchronous coordination, while training is GPU-intensive, centered on forward/backward passes and gradient synchronization.
ProRL Agent operates as a standalone HTTP service managing the full rollout lifecycle through an asynchronous three-stage pipeline. The INIT stage spins up sandbox containers, the RUN stage drives multi-turn agent loops and collects trajectories, and the EVAL stage scores results against ground truth to produce reward signals. By assigning each stage to independent worker pools, the system allows phases to overlap across different jobs, preventing slow evaluations from stalling the rollout process.
The infrastructure utilizes Singularity for sandbox deployment, enabling rootless execution required for shared HPC clusters managed by Slurm. Several optimizations reduce tool execution latency: an efficient Bash implementation replaces tmux-based terminal multiplexing with ptyprocess-based direct pseudo-terminals, reducing shell command latency from 0.78 seconds to 0.42 seconds. Direct IPython API connections to persistent kernels eliminate networking overhead, while Unix Domain Sockets replace TCP loopback for communication between agents and execution servers.
Multi-turn agent tasks involve interacting with external environments such as code repositories or operating systems via iterative tool use. The decoupled architecture makes it easier to migrate to different training backends or support new runtime environments without re-implementing the execution pipeline. The system addresses the maintenance barriers created when rollout logic is embedded directly in trainers.
