Tianshou Worker¶

The Tianshou worker is MOSAIC’s integration of the Tianshou deep reinforcement learning platform. Tianshou (v2.0) provides a modular, type-safe PyTorch framework with clear separation between Algorithm and Policy abstractions, supporting online (on- and off-policy), offline, and imitation learning behind the standard shim pattern.

Paradigm	Single-agent (sequential)
Algorithms	PPO, DQN (integrated); 30+ available upstream (SAC, TD3, DDPG, A2C, TRPO, C51, Rainbow, IQN, FQF, BCQ, CQL, GAIL, ICM, and more)
Environments	Gymnasium, Atari, MuJoCo, Classic Control, Box2D, MiniGrid, Toy Text
Execution	Subprocess (one OS process per training run)
GPU required	No (optional CUDA acceleration)
Upstream version	2.0.0 (integrated as git submodule)
Source	`3rd_party/workers/tianshou_worker/tianshou_worker/`

Note

Early integration. The Tianshou worker currently has PPO and DQN wired end-to-end. The remaining algorithms from Tianshou’s catalog are available in the submodule but have not yet been connected to the MOSAIC launcher and GUI forms. See Current Limitations for details.

About Tianshou¶

Tianshou (meaning “divinely ordained” in Chinese) is developed by Tsinghua University and the appliedAI Institute. Version 2.0 is a complete overhaul that introduces:

Clear separation between Algorithm (learning logic) and Policy (action selection), replacing the monolithic BasePolicy of v1.
Renamed, more intuitive parameters (e.g. n_step_return_horizon instead of n_step).
Type-level separation between on-policy, off-policy, and offline algorithms in the class hierarchy.
High-level ExperimentBuilder API for declarative experiment setup alongside the low-level procedural API for maximum control.

Tianshou’s full algorithm catalog includes:

Family	Algorithms	Notes
Q-Learning	DQN, Double DQN, Dueling DQN, Branching DQN, C51, Rainbow, QRDQN, IQN, FQF	Discrete action spaces
Policy Gradient	PG (REINFORCE), NPG, A2C, TRPO, PPO	On-policy, discrete and continuous
Continuous Control	DDPG, TD3, SAC, REDQ, Discrete SAC	Off-policy actor-critic
Offline RL	BCQ, CQL, TD3+BC, CRR, Discrete BCQ/CQL/CRR	Learning from static datasets
Imitation Learning	IL (vanilla), GAIL	Learning from demonstrations
Exploration	ICM, PER, HER, PSRL	Curiosity, prioritized replay, hindsight

Architecture¶

        graph TB
    subgraph "MOSAIC GUI"
        FORM["Training Form<br/>(Tianshou widgets)"]
        DAEMON["Trainer Daemon"]
    end

    subgraph "Worker Subprocess"
        CLI["cli.py<br/>entry point"]
        CFG["config.py<br/>TianshouWorkerConfig"]
        RT["runtime.py<br/>TianshouWorkerRuntime"]
        LAUNCH["launcher.py<br/>algorithm dispatch"]
    end

    subgraph "Upstream Tianshou (v2.0)"
        ALGO["PPO / DQN / ...<br/>Algorithm + Policy + Collector + Trainer"]
    end

    FORM -->|"config JSON"| DAEMON
    DAEMON -->|"spawn"| CLI
    CLI --> CFG --> RT
    RT -->|"subprocess"| LAUNCH
    LAUNCH --> ALGO

    style FORM fill:#4a90d9,stroke:#2e5a87,color:#fff
    style DAEMON fill:#50c878,stroke:#2e8b57,color:#fff
    style CLI fill:#ff7f50,stroke:#cc5500,color:#fff
    style CFG fill:#ff7f50,stroke:#cc5500,color:#fff
    style RT fill:#ff7f50,stroke:#cc5500,color:#fff
    style LAUNCH fill:#ff7f50,stroke:#cc5500,color:#fff
    style ALGO fill:#e8e8e8,stroke:#999

Lifecycle of a training run:

The GUI form (TianshouTrainForm) builds a TianshouWorkerConfig and hands it to the Trainer Daemon as JSON.
The daemon spawns python -m tianshou_worker.launcher --config-file <path>.
launcher.py loads the config, looks up the algorithm in ALGO_MAP, and calls the corresponding runner function (run_ppo or run_dqn).
The runner creates SubprocVectorEnv, builds the Tianshou 2.0 component stack (Net -> Actor/Critic -> Policy -> Algorithm), sets up Collector + VectorReplayBuffer, and launches training via algorithm.run_training().
TensorBoard metrics are written to var/trainer/runs/{run_id}/.
FastLane environment variables are configured by runtime.py before spawning the subprocess.

Tianshou 2.0 Component Stack¶

Tianshou 2.0 introduces a clean separation of concerns. The MOSAIC launcher constructs the following component stack for each algorithm:

Environment (gymnasium.Env)
  -> SubprocVectorEnv (parallelized)
    -> Collector (data collection)
      -> VectorReplayBuffer (storage)
        -> Algorithm (learning logic)
          -> Policy (action selection)
            -> Network (neural network)
              -> Net / Actor / Critic

PPO stack (on-policy):

# Network
net = Net(state_shape, hidden_sizes=[64, 64])
actor = Actor(net, action_shape)
critic = Critic(net)

# Policy
policy = ProbabilisticActorPolicy(actor, dist_fn, action_space)

# Algorithm
algorithm = PPO(policy, critic, optim, eps_clip=0.2, ...)

# Training
algorithm.run_training(OnPolicyTrainerParams(...))

DQN stack (off-policy):

# Network
net = Net(state_shape, action_shape, hidden_sizes=[128, 128])

# Policy
policy = DiscreteQLearningPolicy(net, action_space, eps_training=0.1)

# Algorithm
algorithm = DQN(policy, optim, gamma=0.99, target_update_freq=320)

# Training
algorithm.run_training(OffPolicyTrainerParams(...))

Configuration¶

The TianshouWorkerConfig dataclass (config.py) is a frozen dataclass implementing the MOSAIC WorkerConfig protocol:

@dataclass(frozen=True)
class TianshouWorkerConfig:
    run_id: str                     # ULID-format unique run identifier
    algo: str                       # Algorithm name ("ppo", "dqn")
    env_id: str                     # Gymnasium environment ID
    total_timesteps: int            # Training budget
    seed: Optional[int] = None      # Random seed
    extras: dict[str, Any] = ...    # Algorithm-specific hyperparameters
    worker_id: Optional[str] = None
    raw: dict[str, Any] = ...       # Full raw payload

Key extras fields:

lr: learning rate
hidden_sizes: network architecture (e.g. [64, 64])
batch_size: optimization batch size
epoch: number of training epochs
buffer_size: replay buffer capacity (off-policy)
num_envs: number of parallel environments
step_per_collect: steps per collection phase (on-policy)
eps_train / eps_test: exploration rates (DQN)
fastlane_enabled: enable real-time frame streaming
video_mode: FastLane video mode ("single")
eval_only: run evaluation instead of training
eval_episodes: number of evaluation episodes
policy_path: path to trained policy checkpoint
resume_from: path to checkpoint for resume

The config supports:

to_dict() / from_dict(): JSON serialization
with_overrides(): create a new config with selective field updates
Nested format loading: extracts config from metadata.worker.config

FastLane Telemetry¶

FastLane environment variables are set by runtime.py via apply_fastlane_environment() before spawning the subprocess:

GYM_GUI_FASTLANE_ONLY: 1 to stream, 0 to disable
GYM_GUI_FASTLANE_SLOT: which parallel env to probe
GYM_GUI_FASTLANE_VIDEO_MODE: "single" (default)
GYM_GUI_FASTLANE_GRID_LIMIT: max envs to tile

GUI Integration¶

The Tianshou worker provides four dedicated form widgets in gym_gui/ui/widgets/ and a presenter in gym_gui/ui/presenters/workers/:

Form	Purpose
`tianshou_train_form.py`	Primary training dialog. Algorithm selection (PPO, DQN), environment family and ID selection, hyperparameter tuning (dynamically generated from `_ALGO_PARAM_SPECS`), seed, timesteps, FastLane toggle.
`tianshou_script_form.py`	Custom Python script launcher. Discovers `.py` scripts from `TIANSHOU_SCRIPTS_DIR` and allows selection/execution.
`tianshou_resume_form.py`	Resume training from a checkpoint. Browses for `.pth`/`.pt` files and auto-populates algorithm/environment by reading `config-*.json` from the checkpoint directory.
`tianshou_policy_form.py`	Policy evaluation dialog. Loads a trained checkpoint, configures evaluation episodes, and optionally enables FastLane rendering.
`tianshou_worker_presenter.py`	Creates `FastLaneTab` for live video streaming when `fastlane_enabled` is set in the run config.

All four forms self-register with the WorkerFormFactory at import time via the factory pattern at the bottom of each module.

Worker Discovery¶

The worker registers itself via the mosaic.workers entry point in pyproject.toml:

[project.entry-points."mosaic.workers"]
tianshou = "tianshou_worker:get_worker_metadata"

get_worker_metadata() returns:

WorkerCapabilities(
    worker_type="tianshou",
    supported_paradigms=("sequential",),
    env_families=("gymnasium", "atari", "mujoco", "pettingzoo"),
    action_spaces=("discrete", "continuous"),
    observation_spaces=("vector", "image"),
    max_agents=1,
    supports_checkpointing=True,
    supports_pause_resume=False,
    requires_gpu=False,
    estimated_memory_mb=512,
)

Current Limitations¶

The Tianshou worker is an early integration with known gaps compared to the CleanRL and XuanCe workers:

Gap	Description
Algorithm coverage	Only PPO and DQN are wired; Tianshou provides 30+ algorithms upstream. The launcher uses a static `ALGO_MAP` dict instead of dynamic registry-based lookup.
No sitecustomize.py	Missing import-time patches for `gym.make()` wrapping, TensorBoard redirect, `torch.save()` auto-mkdir, and checkpoint resume hooks.
No dedicated fastlane.py	No `FastLaneTelemetryWrapper`, no frame throttling, no grid mode. Only basic environment variable setup via `apply_fastlane_environment`.
No analytics manifest	No `analytics.json` written on run completion (TensorBoard paths, checkpoint locations, etc.).
No dry-run validation	CLI does not support `--dry-run`. No `validation_tianshou_worker_form.py` for pre-flight config checks.
No interactive runtime	No step-by-step JSON IPC protocol for GUI-driven policy evaluation.
No curriculum training	No environment switching with weight preservation across phases.
No WANDB integration	No Weights & Biases logging support.
No telemetry emitter	No `run_started` / `heartbeat` / `run_completed` lifecycle events.
Limited test coverage	9 test cases (vs 1000+ in XuanCe worker).

See the development progress report at docs/Development_Progress/1.0_DAY_70/TASK_2/TIANSHOU_WORKER_TECHNICAL_REPORT.md for the full gap analysis and implementation roadmap.

Metadata & Schemas¶

Algorithm hyperparameter schemas are defined in metadata/tianshou/2.0.0/schemas.json:

{
  "algorithms": {
    "ppo": {
      "fields": [
        {"name": "lr", "type": "float", "default": 3e-4, "help": "Learning rate"},
        {"name": "hidden_sizes", "type": "list[int]", "default": [64, 64]},
        {"name": "eps_clip", "type": "float", "default": 0.2},
        {"name": "gae_lambda", "type": "float", "default": 0.95},
        {"name": "batch_size", "type": "int", "default": 64}
      ]
    },
    "dqn": {
      "fields": [
        {"name": "lr", "type": "float", "default": 1e-3},
        {"name": "hidden_sizes", "type": "list[int]", "default": [128, 128]},
        {"name": "gamma", "type": "float", "default": 0.99},
        {"name": "target_update_freq", "type": "int", "default": 320},
        {"name": "is_double", "type": "bool", "default": true}
      ]
    }
  }
}

The TianshouTrainForm uses _ALGO_PARAM_SPECS (hardcoded in the form widget) to dynamically generate hyperparameter input fields when the algorithm selection changes. Future work will drive this from schemas.json instead.

Dependencies¶

The worker depends on:

Tianshou v2.0.0 (git submodule at 3rd_party/workers/tianshou_worker/tianshou)
PyTorch (deep learning backend)
Gymnasium (environment API)
NumPy (numerical operations)
ULID (time-sortable unique run identifiers)

Install with:

pip install -e ".[tianshou]"
pip install -e 3rd_party/workers/tianshou_worker/tianshou