MOSAIC¶

A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers

MOSAIC is a visual-first platform that enables researchers to configure, run, and compare experiments across RL, LLM, VLM, and human decision-makers in the same multi-agent environment. Different paradigms like tiles in a mosaic come together to form a complete picture of agent performance.

MOSAIC Platform Overview — The architecture shows the Evaluation Phase (operators containing workers), Training Phase (TrainerClient, TrainerService, Workers), Daemon Process (gRPC Server, RunRegistry, Dispatcher, Broadcasters), and Worker Processes (CleanRL, XuanCe, Ray RLlib, BALROG).¶

MOSAIC provides two evaluation modes designed for reproducibility:

Manual Mode Side-by-side lock-step evaluation with shared seeds. See Operators & Evaluation Modes and Slow Lane (Render View).

Manual Mode: side-by-side comparison where multiple operators step through the same environment with shared seeds, letting researchers visually inspect decision-making differences between paradigms in real time.

Script Mode: Automated batch evaluation with deterministic seed sequences. See IPC Architecture and Runtime Logging.

Script Mode: automated, long-running evaluation driven by Python scripts that define operator configurations, worker assignments, seed sequences, and episode counts. Scripts execute deterministically with no manual intervention, producing reproducible telemetry logs (JSONL) for every step and episode.

All evaluation runs share identical conditions: same environment seeds, same observations, and unified telemetry. Script Mode additionally supports procedural seeds (different seed per episode to test generalization) and fixed seeds (same seed every episode to isolate agent behaviour), with configurable step pacing for visual inspection or headless batch execution.

GitHub: https://github.com/Abdulhamid97Mousa/MOSAIC

Why MOSAIC?¶

Today’s AI landscape offers powerful but fragmented tools: RL frameworks (CleanRL, RLlib, XuanCe), language models (GPT, Claude), and robotics simulators (MuJoCo). Each excels in isolation, but no platform bridges them together under a unified, visual-first interface.

MOSAIC provides:

Visual-First Design: Configure experiments through an intuitive PyQt6 interface, Almost no code required.
Heterogeneous Agent Mixing: Deploy Human(Agent), RL, and LLM agents in the same environment
Resource Management & Quotas: GPU allocation, queue limits, credit-based backpressure, health monitoring.
Per-Agent Policy Binding: Route each agent to different workers via PolicyMappingService.
Worker Lifecycle Orchestration: Subprocess management with heartbeat monitoring and graceful termination.

Human vs Human: Two human players competing via dedicated USB keyboards. See Human Control and Multi-Keyboard Support (Evdev).

Random Agents: Baseline agents across 26 environment families. See MOSAIC Random Worker and Supported Environments.

Heterogeneous Multi-Agent Ad-Hoc Teamwork in Adversarial Settings: Different decision-making paradigms (RL, LLM, Random) competing head-to-head in the same multi-agent environment. See Heterogeneous Decision-Maker.

Homogeneous Teams: Random vs LLM: Two homogeneous teams (all-Random vs all-LLM) competing in the same multi-agent environment. See Homogeneous Decision-Makers.

Agent-Level Interface and Cross-Paradigm Evaluation¶

Agent-Level Interface. Existing infrastructure lacks the ability to deploy agents from different decision-making paradigms within the same environment. The root cause is an interface mismatch: RL agents expect tensor observations and produce integer actions, while LLM agents expect text prompts and produce text responses. MOSAIC addresses this through an operator abstraction that forms an agent-level interface by mapping workers to agents: each operator, regardless of whether it is backed by an RL policy, an LLM, or a human, conforms to a minimal unified interface (select_action(obs) → action). The environment never needs to know what kind of decision-maker it is communicating with. This is the agent-side counterpart to what Gymnasium did for environments: Gymnasium standardized the environment interface (reset() / step()), so any algorithm can interact with any environment; MOSAIC’s Operator Protocol standardizes the agent interface, so any decision-maker can be plugged into any compatible environment without modifying either side.

Cross-Paradigm Evaluation. Cross-paradigm evaluation is the ability to deploy decision-makers from different paradigms (RL, LLM, VLM, Human, scripted baselines) within the same multi-agent environment under identical conditions, and to produce directly comparable results. Both evaluation modes described above (Manual Mode and Script Mode) guarantee that all decision-makers face the same environment states, observations, and shared seeds, making this the first infrastructure to enable fair, reproducible cross-paradigm evaluation.

See Operator Concept for the full Agent-Level Interface specification, Heterogeneous Decision-Maker for the research gap and design rationale, and IPC Architecture for Manual Mode and Script Mode implementation details.

Comparison with Existing Frameworks¶

Existing frameworks are paradigm-siloed. No prior framework allowed fair, reproducible, head-to-head comparison between RL agents and LLM agents in the same multi-agent environment.

Agent Paradigms: which decision-maker types are supported. Framework: algorithms can be integrated without source-code modifications. Platform GUI: real-time visualization during execution. Cross-Paradigm: infrastructure for comparing different agent types (e.g., RL vs. LLM) on identical environment instances with shared random seeds for reproducible head-to-head evaluation. Legend: ✓ Supported, ✗ Not supported, ◉ Partial.

System	Agent Paradigms				Infrastructure		Evaluation
System	RL	LLM	VLM	Human	Framework	Platform GUI	Cross-Paradigm
RL Frameworks
RLlib [1]	✓	✗	✗	✗	✓	✗	✗
CleanRL [2]	✓	✗	✗	✗	✓	✗	✗
Tianshou [3]	✓	✗	✗	✗	✓	✗	✗
Acme [4]	✓	✗	✗	✗	✓	✗	✗
XuanCe [5]	✓	✗	✗	✗	✓	✗	✗
OpenRL [6]	✓	✗	✗	✗	✓	✗	✗
Stable-Baselines3 [7]	✓	✗	✗	✗	✓	✗	✗
Coach [8]	✓	✗	✗	✗	✓	✓	✗
BenchMARL [15]	✓	✗	✗	✗	✓	✗	✗
LLM/VLM Benchmarks
BALROG [9]	✗	✓	✓	✗	✓	✗	✗
TextArena [10]	✗	✓	✗	✓	✓	✗	✗
GameBench [11]	✗	✓	✗	✗	✓	✗	✗
lmgame-Bench [12]	✗	✓	✗	✗	✓	✗	✗
LLM Chess [13]	✓	✓	✗	✗	✓	✗	✗
LLM-Game-Bench [14]	✗	✓	✗	✗	✓	◉	✗
AgentBench [16]	✗	✓	✗	✗	✓	✗	✗
MultiAgentBench [17]	✗	✓	✗	✗	✓	✗	✗
GAMEBoT [18]	✗	✓	✗	✗	✓	✗	✗
Collab-Overcooked [19]	◉	✓	✗	✗	✓	✗	✗
BotzoneBench [20]	✗	✓	✗	✗	✓	✗	✗
AgentGym [21]	✗	✓	✗	✗	✓	✗	✗
Cross-Paradigm Frameworks
Game Reasoning Arena [22]	✓	✓	◉	◉	✓	✗	✗
CREW [23]	✓	✗	✗	✓	✓	✗	✗
LLM-PySC2 [24]	✓	✓	✗	✗	✓	✗	✗
MOSAIC (Ours)	✓	✓	✓	✓	✓	✓	✓

✓ Supported ✗ Not supported ◉ Partial

Experimental Configurations¶

Heterogeneous decision-making enables a systematic ablation matrix for cross-paradigm research. The following configurations illustrate the design using MOSAIC MultiGrid.

Formal Notation¶

Summary of notation for cross-paradigm multi-agent systems.¶
Symbol	Description
Agent Paradigms
\(\pi^{\text{RL}}_i\)	RL policy trained via reinforcement learning
\(\bar{\pi}^{\text{RL}}_i\)	Frozen RL policy (parameters \(\theta_i\) fixed; no further learning)
\(\lambda^{\text{LLM}}_j\)	LLM agent (large language model, text-only observations)
\(\psi^{\text{VLM}}_k\)	VLM agent (vision-language model, multimodal observations)
\(h_m\)	Human operator (interactive GUI control)
\(\rho\)	Uniform random baseline policy
\(\nu\)	No-op baseline policy (null action at every step)
Agent Sets and Cardinalities
\(\Pi^{\text{RL}} = \{\pi^{\text{RL}}_i\}_{i=1}^{n_{\text{RL}}}\)	Set of RL policies with cardinality \(n_{\text{RL}}\)
\(\Lambda^{\text{LLM}} = \{\lambda^{\text{LLM}}_j\}_{j=1}^{n_{\text{LLM}}}\)	Set of LLM agents with cardinality \(n_{\text{LLM}}\)
\(\Psi^{\text{VLM}} = \{\psi^{\text{VLM}}_k\}_{k=1}^{n_{\text{VLM}}}\)	Set of VLM agents with cardinality \(n_{\text{VLM}}\)
\(\mathcal{H} = \{h_m\}_{m=1}^{n_{\text{H}}}\)	Set of human operators with cardinality \(n_{\text{H}}\)
\(N = n_{\text{RL}} + n_{\text{LLM}} + n_{\text{VLM}} + n_{\text{H}}\)	Total number of agents in the system
Team Partitions
\(\mathcal{T}_A, \mathcal{T}_B\)	Disjoint team partitions: \(\mathcal{T}_A \cap \mathcal{T}_B = \emptyset\), \(\mathcal{T}_A \cup \mathcal{T}_B = \{1,\ldots,N\}\)
\(n_A, n_B\)	Team sizes: \(n_A = \|\mathcal{T}_A\|\), \(n_B = \|\mathcal{T}_B\|\), \(n_A + n_B = N\)
Observation and Action Spaces
\(\mathcal{O}^{\text{RL}} = \mathbb{R}^d\)	RL observation space (continuous tensor)
\(\mathcal{O}^{\text{LLM}} = \Sigma^{*}\)	LLM observation space (strings over alphabet \(\Sigma\))
\(\mathcal{O}^{\text{VLM}} = \Sigma^{*} \times \mathbb{R}^{H \times W \times C}\)	VLM observation space (multimodal: text and RGB image)
\(\mathcal{O}^{\text{H}} = \mathbb{R}^{H \times W \times C}\)	Human observation space (rendered RGB image)
\(\mathcal{A} = \{1,2,\dots,K\}\)	Discrete action space (shared after paradigm-specific parsing)
\(\phi: \Sigma^{*} \to \mathcal{A}\)	Deterministic parsing function mapping LLM/VLM text to actions

Standard Self-Play vs Cross-Paradigm Transfer¶

Comparison: Standard Self-Play vs Cross-Paradigm Transfer¶
Aspect	Standard Self-Play (Baseline)	Cross-Paradigm Transfer (MOSAIC)
Training	Co-training via self-play (\(N \geq 2\))	Solo training (\(N=1\))
Partner Model	Implicit partner model (overfitted to training partner)	Zero partner expectations
Generalization (RL)	Fails with unseen RL partners (ZSC failure)	Generalizes to unseen solo-trained RL partners
Generalization (Cross-Paradigm)	Fails when swapping RL ↔ LLM (Interface mismatch)	Succeeds across paradigm boundaries
Deployment	Requires same-paradigm, familiar partners	Supports RL, LLM, human, scripted agents

Adversarial Cross‑Paradigm Matchups¶

The first set of configurations establishes single-paradigm baselines before introducing cross-paradigm matchups to measure relative performance. Let \(\mathcal{T}_A\) and \(\mathcal{T}_B\) denote disjoint team partitions with \(|\mathcal{T}_A| = n_A\) and \(|\mathcal{T}_B| = n_B\). For each team \(\mathcal{T}_k\) (\(k \in \{A,B\}\)), we define its paradigm composition as \((\Pi^{\text{RL}}_k, \Lambda^{\text{LLM}}_k, \Psi^{\text{VLM}}_k, \mathcal{H}_k)\) where \(|\Pi^{\text{RL}}_k| + |\Lambda^{\text{LLM}}_k| + |\Psi^{\text{VLM}}_k| + |\mathcal{H}_k| = n_k\).

Adversarial configurations for \(N=4\) agents with \(n_A = n_B = 2\)¶
Config	Team A Composition	Team B Composition	Purpose
A1	\(\|\Pi^{\text{RL}}_A\| = 2\)	\(\|\Pi^{\text{RL}}_B\| = 2\)	Homogeneous RL baseline
A2	\(\|\Lambda^{\text{LLM}}_A\| = 2\)	\(\|\Lambda^{\text{LLM}}_B\| = 2\)	Homogeneous LLM baseline
A3	\(\|\Psi^{\text{VLM}}_A\| = 2\)	\(\|\Psi^{\text{VLM}}_B\| = 2\)	Homogeneous VLM baseline
A4	\(\|\Pi^{\text{RL}}_A\| = 2\)	\(\|\Lambda^{\text{LLM}}_B\| = 2\)	Cross-paradigm (RL vs LLM)
A5	\(\|\Pi^{\text{RL}}_A\| = 2\)	\(\|\Psi^{\text{VLM}}_B\| = 2\)	Cross-paradigm (RL vs VLM)
A6	\(\|\Lambda^{\text{LLM}}_A\| = 2\)	\(\|\Psi^{\text{VLM}}_B\| = 2\)	Cross-paradigm (LLM vs VLM)
A7	\(\|\Pi^{\text{RL}}_A\| = 2\)	\(\rho\) baseline (\(n_B = 2\))	Sanity check (trained vs random)

Configurations A1-A3 measure the performance ceiling for homogeneous teams within each paradigm: RL policies trained via MARL, LLM agents reasoning via text-based decision-making, and VLM agents processing multimodal observations. Configurations A4-A6 address the central cross-paradigm research questions: under identical environmental conditions and shared random seeds, does a team of RL policies outperform teams of LLM or VLM agents, and how do LLM and VLM agents compare head-to-head? A7 serves as a sanity check, confirming that trained agents significantly outperform uniform-random baseline policies.

Cooperative Heterogeneous Teams¶

The second set of configurations examines intra-team heterogeneity by mixing paradigms within a team. These configurations test whether LLM or VLM agents (\(\lambda^{\text{LLM}}\) or \(\psi^{\text{VLM}}\)) can effectively cooperate with a frozen RL policy \(\bar{\pi}^{\text{RL}}\) that was trained without any partner model.

Cooperative configurations for \(N=4\) agents with \(n_A = n_B = 2\)¶
Config	Team A Composition	Team B Composition	Research Question
C1	\(\|\Pi^{\text{RL}}_A\| = 1\), \(\|\Lambda^{\text{LLM}}_A\| = 1\)	\(\|\Pi^{\text{RL}}_B\| = 1\), \(\rho\) baseline	Does \(\lambda^{\text{LLM}}\) outperform \(\rho\) as teammate?
C2	\(\|\Pi^{\text{RL}}_A\| = 1\), \(\|\Lambda^{\text{LLM}}_A\| = 1\)	\(\|\Pi^{\text{RL}}_B\| = 1\), \(\nu\) baseline	Does \(\lambda^{\text{LLM}}\) actively contribute?
C3	\(\|\Pi^{\text{RL}}_A\| = 1\), \(\|\Psi^{\text{VLM}}_A\| = 1\)	\(\|\Pi^{\text{RL}}_B\| = 1\), \(\rho\) baseline	Does \(\psi^{\text{VLM}}\) outperform \(\rho\) as teammate?
C4	\(\|\Pi^{\text{RL}}_A\| = 1\), \(\|\Psi^{\text{VLM}}_A\| = 1\)	\(\|\Pi^{\text{RL}}_B\| = 1\), \(\nu\) baseline	Does \(\psi^{\text{VLM}}\) actively contribute?
C5	\(\|\Pi^{\text{RL}}_A\| = 2\)	\(\|\Pi^{\text{RL}}_B\| = 2\)	Solo-pair baseline (no co-training)
C6	\(\|\Pi^{\text{RL}}_A\| = 1\), \(\|\Lambda^{\text{LLM}}_A\| = 1\)	\(\|\Pi^{\text{RL}}_B\| = 2\) (co-trained)	Can zero-shot LLM teaming match co-training?
C7	\(\|\Pi^{\text{RL}}_A\| = 1\), \(\|\Psi^{\text{VLM}}_A\| = 1\)	\(\|\Pi^{\text{RL}}_B\| = 2\) (co-trained)	Can zero-shot VLM teaming match co-training?
C8	\(\|\Pi^{\text{RL}}_A\| = 1\), \(\|\Lambda^{\text{LLM}}_A\| = 1\)	\(\|\Pi^{\text{RL}}_B\| = 1\), \(\|\Psi^{\text{VLM}}_B\| = 1\)	LLM vs VLM as heterogeneous teammates

All RL policies are trained solo (\(N=1\)) and frozen before deployment; LLM/VLM agents are zero-shot. Configurations C1-C2 and C3-C4 test whether LLM and VLM agents can serve as effective teammates for frozen RL policies. C5 serves as the fair comparison baseline: two independently trained solo experts paired at evaluation time. C6-C7 compare zero-shot cross-paradigm teaming against co-trained RL teams. C8 directly compares LLM and VLM agents as teammates within heterogeneous teams.

Solo‑to‑Team Transfer Design – Why Solo Training?

RL agents are trained as solo experts in single-agent environments (\(N=1\)), then deployed as teammates in multi-agent settings without any fine‑tuning. This design eliminates the co-training confound and avoids the failure modes of standard self-play.

In standard self-play, agents develop implicit partner models calibrated against other RL agents sharing the same observation space (\(\mathcal{O} = \mathbb{R}^d\)). This creates two failure modes: (1) ZSC Failure: The agent overfits to its training partner’s conventions, failing to coordinate with unseen RL agents. (2) Cross-Paradigm Failure: As shown in the figure’s “Swap Attempt” panel, replacing an RL partner with an LLM agent causes a breakdown due to observation space mismatches (\(\mathcal{O}^{\text{RL}} \neq \mathcal{O}^{\text{LLM}}\)).

By training agents in isolation (\(N=1\)), the RL policy carries zero partner expectations. This cleanly isolates the paradigm variable as the sole experimental factor, allowing true cross-paradigm coordination where the challenge is not just an unknown policy, but a fundamentally different way of perceiving and acting in the world.

For full mathematical details and further configurations, see the companion paper.

Supported Environment Families¶

MOSAIC supports 26 environment families spanning single-agent, multi-agent, and cooperative/competitive paradigms. See the full Environment Families reference for installation instructions, environment lists, and academic citations.

Family	Description	Example Environments
Gymnasium	Standard single-agent RL (Toy Text, Classic Control, Box2D, MuJoCo)
Atari / ALE	128 classic Atari 2600 games
MiniGrid	Procedural grid-world navigation
BabyAI	Language-grounded instruction following
ViZDoom	Doom-based first-person visual RL
MiniHack / NetHack	Roguelike dungeon crawling (NLE)
Crafter	Open-world survival benchmark
Procgen	16 procedurally generated environments
BabaIsAI	Rule-manipulation puzzles
Jumanji	JAX-accelerated logic/routing/packing (25 envs)
PyBullet Drones	Quadcopter physics simulation
PettingZoo Classic	Turn-based board games (AEC)
MOSAIC MultiGrid	Competitive team sports (view_size=3)
INI MultiGrid	Cooperative exploration (view_size=7)
Melting Pot	Social multi-agent scenarios (up to 16 agents)
Overcooked	Cooperative cooking (2 agents)
SMAC	StarCraft Multi-Agent Challenge (hand-designed maps)
SMACv2	StarCraft Multi-Agent Challenge v2 (procedural units)
RWARE	Cooperative warehouse delivery
MuJoCo	Continuous-control robotics tasks

Supported Workers (8)¶

CleanRL: Single-file RL implementations (PPO, DQN, SAC, TD3, DDPG, C51)
XuanCe: Modular RL framework with flexible algorithm composition and custom environments. Multi-agent algorithms (MAPPO, QMIX, MADDPG, VDN, COMA)
Ray RLlib: RL with distributed training and large-batch optimization (PPO, IMPALA, APPO)
BALROG: LLM/VLM agentic evaluation (GPT-4o, Claude 3, Gemini · NetHack, BabyAI, Crafter)
MOSAIC LLM: Multi-agent LLM with coordination strategies and Theory of Mind (MultiGrid, BabyAI, MeltingPot, PettingZoo)
Chess LLM: LLM chess play with multi-turn dialog (PettingZoo Chess)
MOSAIC Human Worker: Human-in-the-loop play via keyboard for any Gymnasium-compatible environment (MiniGrid, Crafter, Chess, NetHack)
MOSAIC Random Worker: Baseline agents with random, no-op, and cycling action behaviours across all 26 environment families

Citing MOSAIC¶

If you use MOSAIC in your research, please cite the following paper:

@article{mousa2026mosaic,
  author  = {Abdulhamid M. Mousa and Yu Fu and Rakhmonberdi Khajiev and Jalaledin M. Azzabi and Abdulkarim M. Mousa and Peng Yang and Ming Liu},
  title   = {{MOSAIC}: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent {RL}, {LLM}, {VLM}, and Human Decision-Makers},
  year    = {2026},
  url     = {https://github.com/Abdulhamid97Mousa/MOSAIC}
}

References¶

[1] E. Liang et al., "RLlib: Abstractions for Distributed Reinforcement Learning," ICML, 2018.
[2] S. Huang et al., "CleanRL: High-quality Single-file Implementations of Deep RL Algorithms," JMLR, 2022.
[3] J. Weng et al., "Tianshou: A Highly Modularized Deep RL Library," JMLR, 2022.
[4] M. Hoffman et al., "Acme: A Research Framework for Distributed RL," arXiv:2006.00979, 2020.
[5] W. Liu et al., "XuanCe: A Comprehensive and Unified Deep RL Library," arXiv:2312.16248, 2023.
[6] S. Huang et al., "OpenRL: A Unified Reinforcement Learning Framework," arXiv:2312.16189, 2023.
[7] A. Raffin et al., "Stable-Baselines3: Reliable RL Implementations," JMLR, 2021.
[8] I. Caspi et al., "Reinforcement Learning Coach," 2017.
[9] D. Paglieri et al., "BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games," arXiv:2411.13543, 2024.
[10] G. De Magistris et al., "TextArena," 2025.
[11] D. Costarelli et al., "GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents," arXiv:2406.06613, 2024.
[12] Y. Huang et al., "lmgame-Bench: Evaluating LLMs on Game-Theoretic Decision-Making," 2025.
[13] M. Saplin, "LLM Chess," 2025.
[14] J. Guo et al., "LLM-Game-Bench: Evaluating LLM Reasoning through Game-Playing," 2024.
[15] M. Bettini et al., "BenchMARL: Benchmarking Multi-Agent Reinforcement Learning," JMLR, 2024. arXiv:2312.01472.
[16] X. Liu et al., "AgentBench: Evaluating LLMs as Agents," ICLR, 2024. arXiv:2308.03688.
[17] K. Zhu et al., "MultiAgentBench: Evaluating the Collaboration and Competition of LLM Agents," ACL, 2025. arXiv:2503.01935.
[18] Y. Lin et al., "GAMEBoT: Transparent Assessment of LLM Reasoning in Games," ACL, 2025. arXiv:2412.13602.
[19] H. Sun et al., "Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents," EMNLP, 2025. arXiv:2502.20073.
[20] L. Li et al., "BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors," arXiv:2602.13214, 2026.
[21] Z. Xi et al., "AgentGym: Evolving Large Language Model-based Agents across Diverse Environments," ACL, 2025. arXiv:2406.04151.
[22] Cipolina et al., "Game Reasoning Arena: A Comprehensive Evaluation Framework for Large Language Models," arXiv:2501.00363, 2025.
[23] Y. Wang et al., "CREW: A Benchmark for Collaborative Multi-Step Reasoning and Planning," NeurIPS, 2024.
[24] X. Ma et al., "LLM-PySC2: A Benchmark for Large Language Models in StarCraft II," arXiv:2412.19668, 2024.