CleanRL Worker¶
The CleanRL worker is MOSAIC’s single-agent RL integration. It wraps the CleanRL library. A collection of single-file, research-friendly algorithm implementations behind the standard shim pattern, adding subprocess isolation, FastLane telemetry, curriculum learning, and GUI configuration.
Paradigm |
Single-agent (sequential) |
Algorithms |
40+ (PPO, DQN, SAC, TD3, DDPG, C51, Rainbow, and variants) |
Environments |
Gymnasium, Atari, MiniGrid, BabyAI, Procgen, MuJoCo, DM Control |
Execution |
Subprocess (one OS process per training run) |
GPU required |
No (optional CUDA acceleration) |
Source |
|
Architecture¶
graph TB
subgraph "MOSAIC GUI"
FORM["Training Form<br/>(CleanRL widgets)"]
DAEMON["Trainer Daemon"]
end
subgraph "Worker Subprocess"
CLI["cli.py<br/>entry point"]
CFG["config.py<br/>CleanRLWorkerConfig"]
RT["runtime.py<br/>CleanRLWorkerRuntime"]
FL["fastlane.py<br/>FastLaneTelemetryWrapper"]
SITE["sitecustomize.py<br/>import-time gym.make patch"]
LAUNCH["launcher.py<br/>algorithm dispatch"]
end
subgraph "Upstream CleanRL"
ALGO["ppo.py / dqn.py / sac.py / ...<br/>(unmodified single-file scripts)"]
end
FORM -->|"config JSON"| DAEMON
DAEMON -->|"spawn"| CLI
CLI --> CFG --> RT
RT --> LAUNCH --> ALGO
SITE -.->|"patches gym.make()"| ALGO
FL -.->|"shared-memory frames"| DAEMON
style FORM fill:#4a90d9,stroke:#2e5a87,color:#fff
style DAEMON fill:#50c878,stroke:#2e8b57,color:#fff
style CLI fill:#ff7f50,stroke:#cc5500,color:#fff
style CFG fill:#ff7f50,stroke:#cc5500,color:#fff
style RT fill:#ff7f50,stroke:#cc5500,color:#fff
style FL fill:#ff7f50,stroke:#cc5500,color:#fff
style SITE fill:#ff7f50,stroke:#cc5500,color:#fff
style LAUNCH fill:#ff7f50,stroke:#cc5500,color:#fff
style ALGO fill:#e8e8e8,stroke:#999
Lifecycle of a training run:
The GUI form builds a config JSON and hands it to the Trainer Daemon.
The daemon spawns
python -m cleanrl_worker.cli --config <path>.cli.pyloads the config, detects the training mode, and delegates to the appropriate runtime.CleanRLWorkerRuntime.run()resolves the algorithm module from the registry, prepares the run directory, sets FastLane / W&B / TensorBoard environment variables, and launches the algorithm as a subprocess viacleanrl_worker.launcher.sitecustomize.pypatchesgym.make()at import time so every environment is automatically wrapped withFastLaneTelemetryWrapper.The runtime polls the subprocess, emits heartbeats every 30 seconds, and on completion writes an
analytics.jsonmanifest.
Supported Algorithms¶
The DEFAULT_ALGO_REGISTRY in runtime.py maps algorithm names to
importable modules. The first entry (ppo) points to the
MOSAIC-patched version; all others delegate to upstream CleanRL.
Family |
Algorithms |
Notes |
|---|---|---|
PPO |
|
Primary algorithm family; |
Policy Optimization Variants |
|
Phasic Policy Gradient, Periodic Q-Network, Reward-Policy Optimization |
Q-Learning |
|
Deep Q-Network and extensions |
Distributional RL |
|
Categorical DQN (C51) |
Continuous Control |
|
DDPG, TD3, and SAC for continuous action spaces |
Agent Architectures¶
The worker ships with two built-in neural network architectures used by the MOSAIC-patched PPO and curriculum training modes.
MinigridCNN¶
Defined in agents/minigrid.py. Designed for 7x7x3 partially
observable grid-world images (MiniGrid / BabyAI environments).
Input: (B, 7, 7, 3) uint8
-> permute to (B, 3, 7, 7), normalize to [0, 1]
-> Conv2d(3, 32, 3, padding=1) + ReLU
-> Conv2d(32, 64, 3, padding=1) + ReLU
-> Conv2d(64, 64, 3, padding=1) + ReLU
-> Flatten
-> Linear(3136, 128)
MinigridAgent pairs this backbone with separate actor and critic
heads (each Linear(128, 128) -> ReLU -> Linear(128, out)), using
orthogonal weight initialization.
MLPAgent¶
Defined in agents/mlp.py. Used for flat observation spaces
(CartPole, MountainCar, LunarLander, etc.).
Input: (B, obs_dim)
-> Linear(obs_dim, 64) + Tanh
-> Linear(64, 64) + Tanh
Separate actor and critic heads branch from the shared trunk. Hidden size is 64 with Tanh activations and orthogonal initialization.
Training Modes¶
Mode |
Trigger |
Description |
|---|---|---|
Standard training |
Default (no special mode flag) |
Spawns the algorithm as a subprocess via |
Curriculum training |
|
Runs |
Resume training |
|
Loads a |
Policy evaluation |
|
In-process batched evaluation using |
Interactive |
|
Stdin/stdout JSON-lines IPC protocol. The GUI sends |
Dry run |
|
Resolves the algorithm module and validates the config, then exits without launching training. Useful for pre-flight checks. |
Curriculum Training¶
Curriculum training uses Syllabus-RL
to progressively advance through a sequence of environments. The
schedule is a list of stages, each specifying an env_id and an
optional stopping condition:
{
"curriculum_schedule": [
{"env_id": "BabyAI-GoToRedBallNoDists-v0", "steps": 200000},
{"env_id": "BabyAI-GoToRedBall-v0", "steps": 200000},
{"env_id": "BabyAI-GoToObj-v0", "steps": 200000},
{"env_id": "BabyAI-GoToLocal-v0"}
]
}
Stopping conditions per stage: steps>=N, episodes>=N,
episode_return>=X. Multiple conditions can be combined with |
(OR logic). If no condition is specified, the default is
steps>=100000.
The BabyAITaskWrapper (a ReinitTaskWrapper subclass) handles
environment switching at runtime. The training loop (PPO) requires
no modification, curriculum learning operates entirely at the
environment level.
Built-in preset schedules are available in wrappers/curriculum.py:
BABYAI_GOTO_CURRICULUM: four-stage GoTo progressionBABYAI_DOORKEY_CURRICULUM: four-stage DoorKey progression (5x5 to 16x16)
FastLane Telemetry¶
FastLane provides real-time frame streaming from the training subprocess to the MOSAIC GUI via shared memory.
How it works:
sitecustomize.pypatchesgym.make()at import time.Every environment created by the training script is automatically wrapped with
FastLaneTelemetryWrapper.On each
step(), the wrapper callsenv.render()to grab an RGB frame and publishes it throughFastLaneWriter.The GUI reads frames from shared memory and displays them in the training dashboard.
Video modes:
single: only the probe environment (selected byfastlane_slot) emits frames.grid: multiple environments contribute frames; slot 0 coordinates tiling via_GridCoordinator. Thefastlane_grid_limitparameter controls how many environments participate.off: no frame emission.
Metrics published alongside each frame:
last_reward: reward from the most recent steprolling_return: exponentially smoothed episode returnstep_rate_hz: current training throughput
Tuning parameters (environment variables):
CLEANRL_FASTLANE_INTERVAL_MS: minimum milliseconds between frames (throttling)CLEANRL_FASTLANE_MAX_DIM: maximum pixel dimension before downscaling
GUI Integration¶
The CleanRL worker provides four dedicated form widgets for experiment
configuration, all located in gym_gui/ui/widgets/:
Form |
Purpose |
|---|---|
|
Primary training dialog. Algorithm and environment selection, hyperparameter tuning (dynamically generated from schema files), FastLane settings, TensorBoard/W&B tracking, GPU toggle. |
|
Custom shell script launcher for multi-phase training. Reads
|
|
Resume training from a |
|
Policy evaluation dialog. Loads a trained checkpoint, configures evaluation episodes, gamma, and optional video capture. |
Worker Discovery¶
The worker registers itself via the mosaic.workers entry point
group in pyproject.toml:
[project.entry-points."mosaic.workers"]
cleanrl = "cleanrl_worker:get_worker_metadata"
get_worker_metadata() returns a WorkerCapabilities descriptor:
WorkerCapabilities(
worker_type="cleanrl",
supported_paradigms=("sequential",),
env_families=(
"gymnasium", "atari", "procgen",
"mujoco", "dm_control", "minigrid", "babyai",
),
action_spaces=("discrete", "continuous"),
observation_spaces=("vector", "image"),
max_agents=1,
supports_checkpointing=True,
requires_gpu=False,
estimated_memory_mb=512,
)
Configuration¶
The CleanRLWorkerConfig dataclass (config.py) is the single
source of truth for all run parameters:
@dataclass(frozen=True)
class CleanRLWorkerConfig:
run_id: str # ULID-format unique run identifier
algo: str # Algorithm name (e.g. "ppo", "dqn")
env_id: str # Gymnasium environment ID
total_timesteps: int # Training budget
seed: Optional[int] = None
extras: dict[str, Any] = ... # All additional config
worker_id: Optional[str] = None
raw: dict[str, Any] = ... # Full raw payload (for debugging)
The config loader accepts two JSON formats:
Nested (GUI): the config lives at
metadata.worker.configinside the full job descriptor.Flat (standalone): the JSON maps directly to
CleanRLWorkerConfigfields.
Key extras fields:
mode:"train"(default),"policy_eval","resume_training","interactive"cuda/use_cuda: enable GPU accelerationtensorboard_dir: relative path for TensorBoard logstrack_wandb: enable Weights & Biases loggingalgo_params: dict of algorithm-specific hyperparameters passed as CLI flags to the upstream scriptcurriculum_schedule: list of stage dicts (triggers curriculum mode)fastlane_video_mode:"single","grid", or"off"policy_path: path to trained model (for eval/resume)