Tianshou Worker¶
The Tianshou worker is MOSAIC’s integration of the
Tianshou deep reinforcement learning
platform. Tianshou (v2.0) provides a modular, type-safe PyTorch framework
with clear separation between Algorithm and Policy abstractions,
supporting online (on- and off-policy), offline, and imitation learning
behind the standard shim pattern.
Paradigm |
Single-agent (sequential) |
Algorithms |
PPO, DQN (integrated); 30+ available upstream (SAC, TD3, DDPG, A2C, TRPO, C51, Rainbow, IQN, FQF, BCQ, CQL, GAIL, ICM, and more) |
Environments |
Gymnasium, Atari, MuJoCo, Classic Control, Box2D, MiniGrid, Toy Text |
Execution |
Subprocess (one OS process per training run) |
GPU required |
No (optional CUDA acceleration) |
Upstream version |
2.0.0 (integrated as git submodule) |
Source |
|
Note
Early integration. The Tianshou worker currently has PPO and DQN wired end-to-end. The remaining algorithms from Tianshou’s catalog are available in the submodule but have not yet been connected to the MOSAIC launcher and GUI forms. See Current Limitations for details.
About Tianshou¶
Tianshou (meaning “divinely ordained” in Chinese) is developed by Tsinghua University and the appliedAI Institute. Version 2.0 is a complete overhaul that introduces:
Clear separation between
Algorithm(learning logic) andPolicy(action selection), replacing the monolithicBasePolicyof v1.Renamed, more intuitive parameters (e.g.
n_step_return_horizoninstead ofn_step).Type-level separation between on-policy, off-policy, and offline algorithms in the class hierarchy.
High-level
ExperimentBuilderAPI for declarative experiment setup alongside the low-level procedural API for maximum control.
Tianshou’s full algorithm catalog includes:
Family |
Algorithms |
Notes |
|---|---|---|
Q-Learning |
DQN, Double DQN, Dueling DQN, Branching DQN, C51, Rainbow, QRDQN, IQN, FQF |
Discrete action spaces |
Policy Gradient |
PG (REINFORCE), NPG, A2C, TRPO, PPO |
On-policy, discrete and continuous |
Continuous Control |
DDPG, TD3, SAC, REDQ, Discrete SAC |
Off-policy actor-critic |
Offline RL |
BCQ, CQL, TD3+BC, CRR, Discrete BCQ/CQL/CRR |
Learning from static datasets |
Imitation Learning |
IL (vanilla), GAIL |
Learning from demonstrations |
Exploration |
ICM, PER, HER, PSRL |
Curiosity, prioritized replay, hindsight |
Architecture¶
graph TB
subgraph "MOSAIC GUI"
FORM["Training Form<br/>(Tianshou widgets)"]
DAEMON["Trainer Daemon"]
end
subgraph "Worker Subprocess"
CLI["cli.py<br/>entry point"]
CFG["config.py<br/>TianshouWorkerConfig"]
RT["runtime.py<br/>TianshouWorkerRuntime"]
LAUNCH["launcher.py<br/>algorithm dispatch"]
end
subgraph "Upstream Tianshou (v2.0)"
ALGO["PPO / DQN / ...<br/>Algorithm + Policy + Collector + Trainer"]
end
FORM -->|"config JSON"| DAEMON
DAEMON -->|"spawn"| CLI
CLI --> CFG --> RT
RT -->|"subprocess"| LAUNCH
LAUNCH --> ALGO
style FORM fill:#4a90d9,stroke:#2e5a87,color:#fff
style DAEMON fill:#50c878,stroke:#2e8b57,color:#fff
style CLI fill:#ff7f50,stroke:#cc5500,color:#fff
style CFG fill:#ff7f50,stroke:#cc5500,color:#fff
style RT fill:#ff7f50,stroke:#cc5500,color:#fff
style LAUNCH fill:#ff7f50,stroke:#cc5500,color:#fff
style ALGO fill:#e8e8e8,stroke:#999
Lifecycle of a training run:
The GUI form (
TianshouTrainForm) builds aTianshouWorkerConfigand hands it to the Trainer Daemon as JSON.The daemon spawns
python -m tianshou_worker.launcher --config-file <path>.launcher.pyloads the config, looks up the algorithm inALGO_MAP, and calls the corresponding runner function (run_ppoorrun_dqn).The runner creates
SubprocVectorEnv, builds the Tianshou 2.0 component stack (Net->Actor/Critic->Policy->Algorithm), sets upCollector+VectorReplayBuffer, and launches training viaalgorithm.run_training().TensorBoard metrics are written to
var/trainer/runs/{run_id}/.FastLane environment variables are configured by
runtime.pybefore spawning the subprocess.
Tianshou 2.0 Component Stack¶
Tianshou 2.0 introduces a clean separation of concerns. The MOSAIC launcher constructs the following component stack for each algorithm:
Environment (gymnasium.Env)
-> SubprocVectorEnv (parallelized)
-> Collector (data collection)
-> VectorReplayBuffer (storage)
-> Algorithm (learning logic)
-> Policy (action selection)
-> Network (neural network)
-> Net / Actor / Critic
PPO stack (on-policy):
# Network
net = Net(state_shape, hidden_sizes=[64, 64])
actor = Actor(net, action_shape)
critic = Critic(net)
# Policy
policy = ProbabilisticActorPolicy(actor, dist_fn, action_space)
# Algorithm
algorithm = PPO(policy, critic, optim, eps_clip=0.2, ...)
# Training
algorithm.run_training(OnPolicyTrainerParams(...))
DQN stack (off-policy):
# Network
net = Net(state_shape, action_shape, hidden_sizes=[128, 128])
# Policy
policy = DiscreteQLearningPolicy(net, action_space, eps_training=0.1)
# Algorithm
algorithm = DQN(policy, optim, gamma=0.99, target_update_freq=320)
# Training
algorithm.run_training(OffPolicyTrainerParams(...))
Configuration¶
The TianshouWorkerConfig dataclass (config.py) is a frozen
dataclass implementing the MOSAIC WorkerConfig protocol:
@dataclass(frozen=True)
class TianshouWorkerConfig:
run_id: str # ULID-format unique run identifier
algo: str # Algorithm name ("ppo", "dqn")
env_id: str # Gymnasium environment ID
total_timesteps: int # Training budget
seed: Optional[int] = None # Random seed
extras: dict[str, Any] = ... # Algorithm-specific hyperparameters
worker_id: Optional[str] = None
raw: dict[str, Any] = ... # Full raw payload
Key extras fields:
lr: learning ratehidden_sizes: network architecture (e.g.[64, 64])batch_size: optimization batch sizeepoch: number of training epochsbuffer_size: replay buffer capacity (off-policy)num_envs: number of parallel environmentsstep_per_collect: steps per collection phase (on-policy)eps_train/eps_test: exploration rates (DQN)fastlane_enabled: enable real-time frame streamingvideo_mode: FastLane video mode ("single")eval_only: run evaluation instead of trainingeval_episodes: number of evaluation episodespolicy_path: path to trained policy checkpointresume_from: path to checkpoint for resume
The config supports:
to_dict()/from_dict(): JSON serializationwith_overrides(): create a new config with selective field updatesNested format loading: extracts config from
metadata.worker.config
FastLane Telemetry¶
FastLane environment variables are set by runtime.py via
apply_fastlane_environment() before spawning the subprocess:
GYM_GUI_FASTLANE_ONLY:1to stream,0to disableGYM_GUI_FASTLANE_SLOT: which parallel env to probeGYM_GUI_FASTLANE_VIDEO_MODE:"single"(default)GYM_GUI_FASTLANE_GRID_LIMIT: max envs to tile
GUI Integration¶
The Tianshou worker provides four dedicated form widgets in
gym_gui/ui/widgets/ and a presenter in gym_gui/ui/presenters/workers/:
Form |
Purpose |
|---|---|
|
Primary training dialog. Algorithm selection (PPO, DQN),
environment family and ID selection, hyperparameter tuning
(dynamically generated from |
|
Custom Python script launcher. Discovers |
|
Resume training from a checkpoint. Browses for |
|
Policy evaluation dialog. Loads a trained checkpoint, configures evaluation episodes, and optionally enables FastLane rendering. |
|
Creates |
All four forms self-register with the WorkerFormFactory at import
time via the factory pattern at the bottom of each module.
Worker Discovery¶
The worker registers itself via the mosaic.workers entry point
in pyproject.toml:
[project.entry-points."mosaic.workers"]
tianshou = "tianshou_worker:get_worker_metadata"
get_worker_metadata() returns:
WorkerCapabilities(
worker_type="tianshou",
supported_paradigms=("sequential",),
env_families=("gymnasium", "atari", "mujoco", "pettingzoo"),
action_spaces=("discrete", "continuous"),
observation_spaces=("vector", "image"),
max_agents=1,
supports_checkpointing=True,
supports_pause_resume=False,
requires_gpu=False,
estimated_memory_mb=512,
)
Current Limitations¶
The Tianshou worker is an early integration with known gaps compared to the CleanRL and XuanCe workers:
Gap |
Description |
|---|---|
Algorithm coverage |
Only PPO and DQN are wired; Tianshou provides 30+ algorithms
upstream. The launcher uses a static |
No sitecustomize.py |
Missing import-time patches for |
No dedicated fastlane.py |
No |
No analytics manifest |
No |
No dry-run validation |
CLI does not support |
No interactive runtime |
No step-by-step JSON IPC protocol for GUI-driven policy evaluation. |
No curriculum training |
No environment switching with weight preservation across phases. |
No WANDB integration |
No Weights & Biases logging support. |
No telemetry emitter |
No |
Limited test coverage |
9 test cases (vs 1000+ in XuanCe worker). |
See the development progress report at
docs/Development_Progress/1.0_DAY_70/TASK_2/TIANSHOU_WORKER_TECHNICAL_REPORT.md
for the full gap analysis and implementation roadmap.
Metadata & Schemas¶
Algorithm hyperparameter schemas are defined in
metadata/tianshou/2.0.0/schemas.json:
{
"algorithms": {
"ppo": {
"fields": [
{"name": "lr", "type": "float", "default": 3e-4, "help": "Learning rate"},
{"name": "hidden_sizes", "type": "list[int]", "default": [64, 64]},
{"name": "eps_clip", "type": "float", "default": 0.2},
{"name": "gae_lambda", "type": "float", "default": 0.95},
{"name": "batch_size", "type": "int", "default": 64}
]
},
"dqn": {
"fields": [
{"name": "lr", "type": "float", "default": 1e-3},
{"name": "hidden_sizes", "type": "list[int]", "default": [128, 128]},
{"name": "gamma", "type": "float", "default": 0.99},
{"name": "target_update_freq", "type": "int", "default": 320},
{"name": "is_double", "type": "bool", "default": true}
]
}
}
}
The TianshouTrainForm uses _ALGO_PARAM_SPECS (hardcoded in the
form widget) to dynamically generate hyperparameter input fields when the
algorithm selection changes. Future work will drive this from
schemas.json instead.
Dependencies¶
The worker depends on:
Tianshou v2.0.0 (git submodule at
3rd_party/workers/tianshou_worker/tianshou)PyTorch (deep learning backend)
Gymnasium (environment API)
NumPy (numerical operations)
ULID (time-sortable unique run identifiers)
Install with:
pip install -e ".[tianshou]"
pip install -e 3rd_party/workers/tianshou_worker/tianshou