CleanRL Worker¶

The CleanRL worker is MOSAIC’s single-agent RL integration. It wraps the CleanRL library. A collection of single-file, research-friendly algorithm implementations behind the standard shim pattern, adding subprocess isolation, FastLane telemetry, curriculum learning, and GUI configuration.

Paradigm	Single-agent (sequential)
Algorithms	40+ (PPO, DQN, SAC, TD3, DDPG, C51, Rainbow, and variants)
Environments	Gymnasium, Atari, MiniGrid, BabyAI, Procgen, MuJoCo, DM Control
Execution	Subprocess (one OS process per training run)
GPU required	No (optional CUDA acceleration)
Source	`3rd_party/workers/cleanrl_worker/cleanrl_worker/`

Architecture¶

        graph TB
    subgraph "MOSAIC GUI"
        FORM["Training Form<br/>(CleanRL widgets)"]
        DAEMON["Trainer Daemon"]
    end

    subgraph "Worker Subprocess"
        CLI["cli.py<br/>entry point"]
        CFG["config.py<br/>CleanRLWorkerConfig"]
        RT["runtime.py<br/>CleanRLWorkerRuntime"]
        FL["fastlane.py<br/>FastLaneTelemetryWrapper"]
        SITE["sitecustomize.py<br/>import-time gym.make patch"]
        LAUNCH["launcher.py<br/>algorithm dispatch"]
    end

    subgraph "Upstream CleanRL"
        ALGO["ppo.py / dqn.py / sac.py / ...<br/>(unmodified single-file scripts)"]
    end

    FORM -->|"config JSON"| DAEMON
    DAEMON -->|"spawn"| CLI
    CLI --> CFG --> RT
    RT --> LAUNCH --> ALGO
    SITE -.->|"patches gym.make()"| ALGO
    FL -.->|"shared-memory frames"| DAEMON

    style FORM fill:#4a90d9,stroke:#2e5a87,color:#fff
    style DAEMON fill:#50c878,stroke:#2e8b57,color:#fff
    style CLI fill:#ff7f50,stroke:#cc5500,color:#fff
    style CFG fill:#ff7f50,stroke:#cc5500,color:#fff
    style RT fill:#ff7f50,stroke:#cc5500,color:#fff
    style FL fill:#ff7f50,stroke:#cc5500,color:#fff
    style SITE fill:#ff7f50,stroke:#cc5500,color:#fff
    style LAUNCH fill:#ff7f50,stroke:#cc5500,color:#fff
    style ALGO fill:#e8e8e8,stroke:#999

Lifecycle of a training run:

The GUI form builds a config JSON and hands it to the Trainer Daemon.
The daemon spawns python -m cleanrl_worker.cli --config <path>.
cli.py loads the config, detects the training mode, and delegates to the appropriate runtime.
CleanRLWorkerRuntime.run() resolves the algorithm module from the registry, prepares the run directory, sets FastLane / W&B / TensorBoard environment variables, and launches the algorithm as a subprocess via cleanrl_worker.launcher.
sitecustomize.py patches gym.make() at import time so every environment is automatically wrapped with FastLaneTelemetryWrapper.
The runtime polls the subprocess, emits heartbeats every 30 seconds, and on completion writes an analytics.json manifest.

Supported Algorithms¶

The DEFAULT_ALGO_REGISTRY in runtime.py maps algorithm names to importable modules. The first entry (ppo) points to the MOSAIC-patched version; all others delegate to upstream CleanRL.

Family	Algorithms	Notes
PPO	`ppo` (MOSAIC-patched), `ppo_continuous_action`, `ppo_atari`, `ppo_atari_multigpu`, `ppo_atari_lstm`, `ppo_atari_envpool`, `ppo_atari_envpool_xla_jax`, `ppo_atari_envpool_xla_jax_scan`, `ppo_procgen`, `ppo_rnd_envpool`, `ppo_pettingzoo_ma_atari`	Primary algorithm family; `ppo` is the default
Policy Optimization Variants	`ppg_procgen`, `pqn`, `pqn_atari_envpool`, `pqn_atari_envpool_lstm`, `rpo_continuous_action`	Phasic Policy Gradient, Periodic Q-Network, Reward-Policy Optimization
Q-Learning	`dqn`, `dqn_atari`, `dqn_jax`, `dqn_atari_jax`, `rainbow_atari`, `qdagger_dqn_atari_impalacnn`, `qdagger_dqn_atari_jax_impalacnn`	Deep Q-Network and extensions
Distributional RL	`c51`, `c51_jax`, `c51_atari`, `c51_atari_jax`	Categorical DQN (C51)
Continuous Control	`ddpg_continuous_action`, `ddpg_continuous_action_jax`, `td3_continuous_action`, `td3_continuous_action_jax`, `sac_continuous_action`, `sac_atari`	DDPG, TD3, and SAC for continuous action spaces

Agent Architectures¶

The worker ships with two built-in neural network architectures used by the MOSAIC-patched PPO and curriculum training modes.

MinigridCNN¶

Defined in agents/minigrid.py. Designed for 7x7x3 partially observable grid-world images (MiniGrid / BabyAI environments).

Input: (B, 7, 7, 3) uint8
  -> permute to (B, 3, 7, 7), normalize to [0, 1]
  -> Conv2d(3, 32, 3, padding=1) + ReLU
  -> Conv2d(32, 64, 3, padding=1) + ReLU
  -> Conv2d(64, 64, 3, padding=1) + ReLU
  -> Flatten
  -> Linear(3136, 128)

MinigridAgent pairs this backbone with separate actor and critic heads (each Linear(128, 128) -> ReLU -> Linear(128, out)), using orthogonal weight initialization.

MLPAgent¶

Defined in agents/mlp.py. Used for flat observation spaces (CartPole, MountainCar, LunarLander, etc.).

Input: (B, obs_dim)
  -> Linear(obs_dim, 64) + Tanh
  -> Linear(64, 64) + Tanh

Separate actor and critic heads branch from the shared trunk. Hidden size is 64 with Tanh activations and orthogonal initialization.

Training Modes¶

Mode	Trigger	Description
Standard training	Default (no special mode flag)	Spawns the algorithm as a subprocess via `cleanrl_worker.launcher`. Full FastLane telemetry, TensorBoard, and W&B support.
Curriculum training	`extras.curriculum_schedule` present	Runs `run_curriculum_training()` with Syllabus-RL environment switching. Environments are swapped according to the schedule without restarting the training loop.
Resume training	`extras.mode == "resume_training"`	Loads a `.cleanrl_model` checkpoint from `extras.checkpoint_path` and continues training for additional timesteps.
Policy evaluation	`extras.mode == "policy_eval"`	In-process batched evaluation using `run_batched_evaluation()`. Supports configurable episodes, gamma, capture video, and repeat.
Interactive	`--interactive` CLI flag	Stdin/stdout JSON-lines IPC protocol. The GUI sends `reset`, `step`, and `stop` commands; the worker responds with observations, rewards, and render frames.
Dry run	`--dry-run` CLI flag	Resolves the algorithm module and validates the config, then exits without launching training. Useful for pre-flight checks.

Curriculum Training¶

Curriculum training uses Syllabus-RL to progressively advance through a sequence of environments. The schedule is a list of stages, each specifying an env_id and an optional stopping condition:

{
  "curriculum_schedule": [
    {"env_id": "BabyAI-GoToRedBallNoDists-v0", "steps": 200000},
    {"env_id": "BabyAI-GoToRedBall-v0", "steps": 200000},
    {"env_id": "BabyAI-GoToObj-v0", "steps": 200000},
    {"env_id": "BabyAI-GoToLocal-v0"}
  ]
}

Stopping conditions per stage: steps>=N, episodes>=N, episode_return>=X. Multiple conditions can be combined with | (OR logic). If no condition is specified, the default is steps>=100000.

The BabyAITaskWrapper (a ReinitTaskWrapper subclass) handles environment switching at runtime. The training loop (PPO) requires no modification, curriculum learning operates entirely at the environment level.

Built-in preset schedules are available in wrappers/curriculum.py:

BABYAI_GOTO_CURRICULUM: four-stage GoTo progression
BABYAI_DOORKEY_CURRICULUM: four-stage DoorKey progression (5x5 to 16x16)

FastLane Telemetry¶

FastLane provides real-time frame streaming from the training subprocess to the MOSAIC GUI via shared memory.

How it works:

sitecustomize.py patches gym.make() at import time.
Every environment created by the training script is automatically wrapped with FastLaneTelemetryWrapper.
On each step(), the wrapper calls env.render() to grab an RGB frame and publishes it through FastLaneWriter.
The GUI reads frames from shared memory and displays them in the training dashboard.

Video modes:

single: only the probe environment (selected by fastlane_slot) emits frames.
grid: multiple environments contribute frames; slot 0 coordinates tiling via _GridCoordinator. The fastlane_grid_limit parameter controls how many environments participate.
off: no frame emission.

Metrics published alongside each frame:

last_reward: reward from the most recent step
rolling_return: exponentially smoothed episode return
step_rate_hz: current training throughput

Tuning parameters (environment variables):

CLEANRL_FASTLANE_INTERVAL_MS: minimum milliseconds between frames (throttling)
CLEANRL_FASTLANE_MAX_DIM: maximum pixel dimension before downscaling

GUI Integration¶

The CleanRL worker provides four dedicated form widgets for experiment configuration, all located in gym_gui/ui/widgets/:

Form	Purpose
`cleanrl_train_form.py`	Primary training dialog. Algorithm and environment selection, hyperparameter tuning (dynamically generated from schema files), FastLane settings, TensorBoard/W&B tracking, GPU toggle.
`cleanrl_script_form.py`	Custom shell script launcher for multi-phase training. Reads `*.sh` scripts and parses inline metadata (`@description`, `@phases`, `@total_timesteps`).
`cleanrl_resume_form.py`	Resume training from a `.cleanrl_model` checkpoint. Auto-discovers checkpoints under `var/trainer/runs/`.
`cleanrl_policy_form.py`	Policy evaluation dialog. Loads a trained checkpoint, configures evaluation episodes, gamma, and optional video capture.

Worker Discovery¶

The worker registers itself via the mosaic.workers entry point group in pyproject.toml:

[project.entry-points."mosaic.workers"]
cleanrl = "cleanrl_worker:get_worker_metadata"

get_worker_metadata() returns a WorkerCapabilities descriptor:

WorkerCapabilities(
    worker_type="cleanrl",
    supported_paradigms=("sequential",),
    env_families=(
        "gymnasium", "atari", "procgen",
        "mujoco", "dm_control", "minigrid", "babyai",
    ),
    action_spaces=("discrete", "continuous"),
    observation_spaces=("vector", "image"),
    max_agents=1,
    supports_checkpointing=True,
    requires_gpu=False,
    estimated_memory_mb=512,
)

Configuration¶

The CleanRLWorkerConfig dataclass (config.py) is the single source of truth for all run parameters:

@dataclass(frozen=True)
class CleanRLWorkerConfig:
    run_id: str                    # ULID-format unique run identifier
    algo: str                      # Algorithm name (e.g. "ppo", "dqn")
    env_id: str                    # Gymnasium environment ID
    total_timesteps: int           # Training budget
    seed: Optional[int] = None
    extras: dict[str, Any] = ...   # All additional config
    worker_id: Optional[str] = None
    raw: dict[str, Any] = ...      # Full raw payload (for debugging)

The config loader accepts two JSON formats:

Nested (GUI): the config lives at metadata.worker.config inside the full job descriptor.
Flat (standalone): the JSON maps directly to CleanRLWorkerConfig fields.

Key extras fields:

mode: "train" (default), "policy_eval", "resume_training", "interactive"
cuda / use_cuda: enable GPU acceleration
tensorboard_dir: relative path for TensorBoard logs
track_wandb: enable Weights & Biases logging
algo_params: dict of algorithm-specific hyperparameters passed as CLI flags to the upstream script
curriculum_schedule: list of stage dicts (triggers curriculum mode)
fastlane_video_mode: "single", "grid", or "off"
policy_path: path to trained model (for eval/resume)