MOSAIC

arXiv GitHub License Python PyTorch Gymnasium PettingZoo

A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers

MOSAIC is a visual-first platform that enables researchers to configure, run, and compare experiments across RL, LLM, VLM, and human decision-makers in the same multi-agent environment. Different paradigms like tiles in a mosaic come together to form a complete picture of agent performance.

MOSAIC Platform Overview

The architecture shows the Evaluation Phase (operators containing workers), Training Phase (TrainerClient, TrainerService, Workers), Daemon Process (gRPC Server, RunRegistry, Dispatcher, Broadcasters), and Worker Processes (CleanRL, XuanCe, Ray RLlib, BALROG).


MOSAIC provides two evaluation modes designed for reproducibility:

Manual Mode Side-by-side lock-step evaluation with shared seeds. See Operators & Evaluation Modes and Slow Lane (Render View).

  • Manual Mode: side-by-side comparison where multiple operators step through the same environment with shared seeds, letting researchers visually inspect decision-making differences between paradigms in real time.

Script Mode: Automated batch evaluation with deterministic seed sequences. See IPC Architecture and Runtime Logging.

  • Script Mode: automated, long-running evaluation driven by Python scripts that define operator configurations, worker assignments, seed sequences, and episode counts. Scripts execute deterministically with no manual intervention, producing reproducible telemetry logs (JSONL) for every step and episode.

All evaluation runs share identical conditions: same environment seeds, same observations, and unified telemetry. Script Mode additionally supports procedural seeds (different seed per episode to test generalization) and fixed seeds (same seed every episode to isolate agent behaviour), with configurable step pacing for visual inspection or headless batch execution.

Why MOSAIC?

Today’s AI landscape offers powerful but fragmented tools: RL frameworks (CleanRL, RLlib, XuanCe), language models (GPT, Claude), and robotics simulators (MuJoCo). Each excels in isolation, but no platform bridges them together under a unified, visual-first interface.

MOSAIC provides:

  • Visual-First Design: Configure experiments through an intuitive PyQt6 interface, Almost no code required.

  • Heterogeneous Agent Mixing: Deploy Human(Agent), RL, and LLM agents in the same environment

  • Resource Management & Quotas: GPU allocation, queue limits, credit-based backpressure, health monitoring.

  • Per-Agent Policy Binding: Route each agent to different workers via PolicyMappingService.

  • Worker Lifecycle Orchestration: Subprocess management with heartbeat monitoring and graceful termination.

Human vs Human: Two human players competing via dedicated USB keyboards. See Human Control and Multi-Keyboard Support (Evdev).

Random Agents: Baseline agents across 26 environment families. See MOSAIC Random Worker and Supported Environments.

Heterogeneous Multi-Agent Ad-Hoc Teamwork in Adversarial Settings: Different decision-making paradigms (RL, LLM, Random) competing head-to-head in the same multi-agent environment. See Heterogeneous Decision-Maker.

Homogeneous Teams: Random vs LLM: Two homogeneous teams (all-Random vs all-LLM) competing in the same multi-agent environment. See Homogeneous Decision-Makers.

Agent-Level Interface and Cross-Paradigm Evaluation

Agent-Level Interface. Existing infrastructure lacks the ability to deploy agents from different decision-making paradigms within the same environment. The root cause is an interface mismatch: RL agents expect tensor observations and produce integer actions, while LLM agents expect text prompts and produce text responses. MOSAIC addresses this through an operator abstraction that forms an agent-level interface by mapping workers to agents: each operator, regardless of whether it is backed by an RL policy, an LLM, or a human, conforms to a minimal unified interface (select_action(obs) action). The environment never needs to know what kind of decision-maker it is communicating with. This is the agent-side counterpart to what Gymnasium did for environments: Gymnasium standardized the environment interface (reset() / step()), so any algorithm can interact with any environment; MOSAIC’s Operator Protocol standardizes the agent interface, so any decision-maker can be plugged into any compatible environment without modifying either side.

Cross-Paradigm Evaluation. Cross-paradigm evaluation is the ability to deploy decision-makers from different paradigms (RL, LLM, VLM, Human, scripted baselines) within the same multi-agent environment under identical conditions, and to produce directly comparable results. Both evaluation modes described above (Manual Mode and Script Mode) guarantee that all decision-makers face the same environment states, observations, and shared seeds, making this the first infrastructure to enable fair, reproducible cross-paradigm evaluation.

See Operator Concept for the full Agent-Level Interface specification, Heterogeneous Decision-Maker for the research gap and design rationale, and IPC Architecture for Manual Mode and Script Mode implementation details.

Comparison with Existing Frameworks

Existing frameworks are paradigm-siloed. No prior framework allowed fair, reproducible, head-to-head comparison between RL agents and LLM agents in the same multi-agent environment.

Agent Paradigms: which decision-maker types are supported. Framework: algorithms can be integrated without source-code modifications. Platform GUI: real-time visualization during execution. Cross-Paradigm: infrastructure for comparing different agent types (e.g., RL vs. LLM) on identical environment instances with shared random seeds for reproducible head-to-head evaluation. Legend: ✓ Supported, ✗ Not supported, ◉ Partial.

System Agent Paradigms Infrastructure Evaluation
RLLLMVLMHuman FrameworkPlatform GUICross-Paradigm
RL Frameworks
RLlib [1]
CleanRL [2]
Tianshou [3]
Acme [4]
XuanCe [5]
OpenRL [6]
Stable-Baselines3 [7]
Coach [8]
BenchMARL [15]
LLM/VLM Benchmarks
BALROG [9]
TextArena [10]
GameBench [11]
lmgame-Bench [12]
LLM Chess [13]
LLM-Game-Bench [14]
AgentBench [16]
MultiAgentBench [17]
GAMEBoT [18]
Collab-Overcooked [19]
BotzoneBench [20]
AgentGym [21]
Cross-Paradigm Frameworks
Game Reasoning Arena [22]
CREW [23]
LLM-PySC2 [24]
MOSAIC (Ours)

Supported    Not supported    Partial

Experimental Configurations

Heterogeneous decision-making enables a systematic ablation matrix for cross-paradigm research. The following configurations illustrate the design using MOSAIC MultiGrid.

Formal Notation

Summary of notation for cross-paradigm multi-agent systems.

Symbol

Description

Agent Paradigms

\(\pi^{\text{RL}}_i\)

RL policy trained via reinforcement learning

\(\bar{\pi}^{\text{RL}}_i\)

Frozen RL policy (parameters \(\theta_i\) fixed; no further learning)

\(\lambda^{\text{LLM}}_j\)

LLM agent (large language model, text-only observations)

\(\psi^{\text{VLM}}_k\)

VLM agent (vision-language model, multimodal observations)

\(h_m\)

Human operator (interactive GUI control)

\(\rho\)

Uniform random baseline policy

\(\nu\)

No-op baseline policy (null action at every step)

Agent Sets and Cardinalities

\(\Pi^{\text{RL}} = \{\pi^{\text{RL}}_i\}_{i=1}^{n_{\text{RL}}}\)

Set of RL policies with cardinality \(n_{\text{RL}}\)

\(\Lambda^{\text{LLM}} = \{\lambda^{\text{LLM}}_j\}_{j=1}^{n_{\text{LLM}}}\)

Set of LLM agents with cardinality \(n_{\text{LLM}}\)

\(\Psi^{\text{VLM}} = \{\psi^{\text{VLM}}_k\}_{k=1}^{n_{\text{VLM}}}\)

Set of VLM agents with cardinality \(n_{\text{VLM}}\)

\(\mathcal{H} = \{h_m\}_{m=1}^{n_{\text{H}}}\)

Set of human operators with cardinality \(n_{\text{H}}\)

\(N = n_{\text{RL}} + n_{\text{LLM}} + n_{\text{VLM}} + n_{\text{H}}\)

Total number of agents in the system

Team Partitions

\(\mathcal{T}_A, \mathcal{T}_B\)

Disjoint team partitions: \(\mathcal{T}_A \cap \mathcal{T}_B = \emptyset\), \(\mathcal{T}_A \cup \mathcal{T}_B = \{1,\ldots,N\}\)

\(n_A, n_B\)

Team sizes: \(n_A = |\mathcal{T}_A|\), \(n_B = |\mathcal{T}_B|\), \(n_A + n_B = N\)

Observation and Action Spaces

\(\mathcal{O}^{\text{RL}} = \mathbb{R}^d\)

RL observation space (continuous tensor)

\(\mathcal{O}^{\text{LLM}} = \Sigma^{*}\)

LLM observation space (strings over alphabet \(\Sigma\))

\(\mathcal{O}^{\text{VLM}} = \Sigma^{*} \times \mathbb{R}^{H \times W \times C}\)

VLM observation space (multimodal: text and RGB image)

\(\mathcal{O}^{\text{H}} = \mathbb{R}^{H \times W \times C}\)

Human observation space (rendered RGB image)

\(\mathcal{A} = \{1,2,\dots,K\}\)

Discrete action space (shared after paradigm-specific parsing)

\(\phi: \Sigma^{*} \to \mathcal{A}\)

Deterministic parsing function mapping LLM/VLM text to actions

Standard Self-Play vs Cross-Paradigm Transfer

Standard Self-Play vs Cross-Paradigm Transfer

Standard Self-Play and Cross-Paradigm Transfer. (a) Standard Self-Play (Baseline): Agents \(\pi^{RL}_1\) and \(\pi^{RL}_2\) are co-trained, learning implicit partner models that overfit to the specific environment. This approach fails the Zero-Shot Coordination (ZSC) challenge because it struggles to coordinate with unseen RL partners (who may have learned different features). It collapses when a partner is swapped across paradigms (e.g., \(\pi^{RL}\) paired with \(\lambda^{LLM}\)) due to observation space mismatches (\(\mathcal{O}^{\text{RL}} \neq \mathcal{O}^{\text{LLM}}\)) and violated behavioral expectations. (b) Cross-Paradigm Transfer (MOSAIC): Agent \(\pi^{RL}\) is trained solo (\(N=1\), zero partner expectations), then deployed in multi-agent teams alongside heterogeneous partners such as LLM agents \(\lambda^{LLM}\), human players \(h\), or random baselines. By eliminating co-training dependencies, agents can cooperate across paradigm boundaries using a unified action interface.

Comparison: Standard Self-Play vs Cross-Paradigm Transfer

Aspect

Standard Self-Play (Baseline)

Cross-Paradigm Transfer (MOSAIC)

Training

Co-training via self-play (\(N \geq 2\))

Solo training (\(N=1\))

Partner Model

Implicit partner model (overfitted to training partner)

Zero partner expectations

Generalization (RL)

Fails with unseen RL partners (ZSC failure)

Generalizes to unseen solo-trained RL partners

Generalization (Cross-Paradigm)

Fails when swapping RL ↔ LLM (Interface mismatch)

Succeeds across paradigm boundaries

Deployment

Requires same-paradigm, familiar partners

Supports RL, LLM, human, scripted agents

Adversarial Cross‑Paradigm Matchups

The first set of configurations establishes single-paradigm baselines before introducing cross-paradigm matchups to measure relative performance. Let \(\mathcal{T}_A\) and \(\mathcal{T}_B\) denote disjoint team partitions with \(|\mathcal{T}_A| = n_A\) and \(|\mathcal{T}_B| = n_B\). For each team \(\mathcal{T}_k\) (\(k \in \{A,B\}\)), we define its paradigm composition as \((\Pi^{\text{RL}}_k, \Lambda^{\text{LLM}}_k, \Psi^{\text{VLM}}_k, \mathcal{H}_k)\) where \(|\Pi^{\text{RL}}_k| + |\Lambda^{\text{LLM}}_k| + |\Psi^{\text{VLM}}_k| + |\mathcal{H}_k| = n_k\).

Adversarial configurations for \(N=4\) agents with \(n_A = n_B = 2\)

Config

Team A Composition

Team B Composition

Purpose

A1

\(|\Pi^{\text{RL}}_A| = 2\)

\(|\Pi^{\text{RL}}_B| = 2\)

Homogeneous RL baseline

A2

\(|\Lambda^{\text{LLM}}_A| = 2\)

\(|\Lambda^{\text{LLM}}_B| = 2\)

Homogeneous LLM baseline

A3

\(|\Psi^{\text{VLM}}_A| = 2\)

\(|\Psi^{\text{VLM}}_B| = 2\)

Homogeneous VLM baseline

A4

\(|\Pi^{\text{RL}}_A| = 2\)

\(|\Lambda^{\text{LLM}}_B| = 2\)

Cross-paradigm (RL vs LLM)

A5

\(|\Pi^{\text{RL}}_A| = 2\)

\(|\Psi^{\text{VLM}}_B| = 2\)

Cross-paradigm (RL vs VLM)

A6

\(|\Lambda^{\text{LLM}}_A| = 2\)

\(|\Psi^{\text{VLM}}_B| = 2\)

Cross-paradigm (LLM vs VLM)

A7

\(|\Pi^{\text{RL}}_A| = 2\)

\(\rho\) baseline (\(n_B = 2\))

Sanity check (trained vs random)

Configurations A1-A3 measure the performance ceiling for homogeneous teams within each paradigm: RL policies trained via MARL, LLM agents reasoning via text-based decision-making, and VLM agents processing multimodal observations. Configurations A4-A6 address the central cross-paradigm research questions: under identical environmental conditions and shared random seeds, does a team of RL policies outperform teams of LLM or VLM agents, and how do LLM and VLM agents compare head-to-head? A7 serves as a sanity check, confirming that trained agents significantly outperform uniform-random baseline policies.

Cooperative Heterogeneous Teams

The second set of configurations examines intra-team heterogeneity by mixing paradigms within a team. These configurations test whether LLM or VLM agents (\(\lambda^{\text{LLM}}\) or \(\psi^{\text{VLM}}\)) can effectively cooperate with a frozen RL policy \(\bar{\pi}^{\text{RL}}\) that was trained without any partner model.

Cooperative configurations for \(N=4\) agents with \(n_A = n_B = 2\)

Config

Team A Composition

Team B Composition

Research Question

C1

\(|\Pi^{\text{RL}}_A| = 1\), \(|\Lambda^{\text{LLM}}_A| = 1\)

\(|\Pi^{\text{RL}}_B| = 1\), \(\rho\) baseline

Does \(\lambda^{\text{LLM}}\) outperform \(\rho\) as teammate?

C2

\(|\Pi^{\text{RL}}_A| = 1\), \(|\Lambda^{\text{LLM}}_A| = 1\)

\(|\Pi^{\text{RL}}_B| = 1\), \(\nu\) baseline

Does \(\lambda^{\text{LLM}}\) actively contribute?

C3

\(|\Pi^{\text{RL}}_A| = 1\), \(|\Psi^{\text{VLM}}_A| = 1\)

\(|\Pi^{\text{RL}}_B| = 1\), \(\rho\) baseline

Does \(\psi^{\text{VLM}}\) outperform \(\rho\) as teammate?

C4

\(|\Pi^{\text{RL}}_A| = 1\), \(|\Psi^{\text{VLM}}_A| = 1\)

\(|\Pi^{\text{RL}}_B| = 1\), \(\nu\) baseline

Does \(\psi^{\text{VLM}}\) actively contribute?

C5

\(|\Pi^{\text{RL}}_A| = 2\)

\(|\Pi^{\text{RL}}_B| = 2\)

Solo-pair baseline (no co-training)

C6

\(|\Pi^{\text{RL}}_A| = 1\), \(|\Lambda^{\text{LLM}}_A| = 1\)

\(|\Pi^{\text{RL}}_B| = 2\) (co-trained)

Can zero-shot LLM teaming match co-training?

C7

\(|\Pi^{\text{RL}}_A| = 1\), \(|\Psi^{\text{VLM}}_A| = 1\)

\(|\Pi^{\text{RL}}_B| = 2\) (co-trained)

Can zero-shot VLM teaming match co-training?

C8

\(|\Pi^{\text{RL}}_A| = 1\), \(|\Lambda^{\text{LLM}}_A| = 1\)

\(|\Pi^{\text{RL}}_B| = 1\), \(|\Psi^{\text{VLM}}_B| = 1\)

LLM vs VLM as heterogeneous teammates

All RL policies are trained solo (\(N=1\)) and frozen before deployment; LLM/VLM agents are zero-shot. Configurations C1-C2 and C3-C4 test whether LLM and VLM agents can serve as effective teammates for frozen RL policies. C5 serves as the fair comparison baseline: two independently trained solo experts paired at evaluation time. C6-C7 compare zero-shot cross-paradigm teaming against co-trained RL teams. C8 directly compares LLM and VLM agents as teammates within heterogeneous teams.

Solo‑to‑Team Transfer Design – Why Solo Training?

RL agents are trained as solo experts in single-agent environments (\(N=1\)), then deployed as teammates in multi-agent settings without any fine‑tuning. This design eliminates the co-training confound and avoids the failure modes of standard self-play.

In standard self-play, agents develop implicit partner models calibrated against other RL agents sharing the same observation space (\(\mathcal{O} = \mathbb{R}^d\)). This creates two failure modes: (1) ZSC Failure: The agent overfits to its training partner’s conventions, failing to coordinate with unseen RL agents. (2) Cross-Paradigm Failure: As shown in the figure’s “Swap Attempt” panel, replacing an RL partner with an LLM agent causes a breakdown due to observation space mismatches (\(\mathcal{O}^{\text{RL}} \neq \mathcal{O}^{\text{LLM}}\)).

By training agents in isolation (\(N=1\)), the RL policy carries zero partner expectations. This cleanly isolates the paradigm variable as the sole experimental factor, allowing true cross-paradigm coordination where the challenge is not just an unknown policy, but a fundamentally different way of perceiving and acting in the world.

For full mathematical details and further configurations, see the companion paper.

Supported Environment Families

MOSAIC supports 26 environment families spanning single-agent, multi-agent, and cooperative/competitive paradigms. See the full Environment Families reference for installation instructions, environment lists, and academic citations.

Family

Description

Example Environments

Gymnasium

Standard single-agent RL (Toy Text, Classic Control, Box2D, MuJoCo)

_images/cartpole.gif

Atari / ALE

128 classic Atari 2600 games

_images/atari.gif

MiniGrid

Procedural grid-world navigation

_images/minigrid.gif

BabyAI

Language-grounded instruction following

_images/GoTo.gif

ViZDoom

Doom-based first-person visual RL

_images/vizdoom.gif

MiniHack / NetHack

Roguelike dungeon crawling (NLE)

_images/minihack.gif

Crafter

Open-world survival benchmark

_images/crafter.gif

Procgen

16 procedurally generated environments

_images/coinrun.gif

BabaIsAI

Rule-manipulation puzzles

_images/babaisai.png

Jumanji

JAX-accelerated logic/routing/packing (25 envs)

_images/jumanji.gif

PyBullet Drones

Quadcopter physics simulation

_images/pybullet_drones.gif

PettingZoo Classic

Turn-based board games (AEC)

_images/pettingzoo.gif

MOSAIC MultiGrid

Competitive team sports (view_size=3)

_images/mosaic_multigrid.gif

INI MultiGrid

Cooperative exploration (view_size=7)

_images/multigrid_ini.gif

Melting Pot

Social multi-agent scenarios (up to 16 agents)

_images/meltingpot.gif

Overcooked

Cooperative cooking (2 agents)

_images/overcooked_layouts.gif

SMAC

StarCraft Multi-Agent Challenge (hand-designed maps)

_images/smac.gif

SMACv2

StarCraft Multi-Agent Challenge v2 (procedural units)

_images/smacv2.png

RWARE

Cooperative warehouse delivery

_images/rware.gif

MuJoCo

Continuous-control robotics tasks

_images/ant.gif

Supported Workers (8)

  • CleanRL: Single-file RL implementations (PPO, DQN, SAC, TD3, DDPG, C51)

  • XuanCe: Modular RL framework with flexible algorithm composition and custom environments. Multi-agent algorithms (MAPPO, QMIX, MADDPG, VDN, COMA)

  • Ray RLlib: RL with distributed training and large-batch optimization (PPO, IMPALA, APPO)

  • BALROG: LLM/VLM agentic evaluation (GPT-4o, Claude 3, Gemini · NetHack, BabyAI, Crafter)

  • MOSAIC LLM: Multi-agent LLM with coordination strategies and Theory of Mind (MultiGrid, BabyAI, MeltingPot, PettingZoo)

  • Chess LLM: LLM chess play with multi-turn dialog (PettingZoo Chess)

  • MOSAIC Human Worker: Human-in-the-loop play via keyboard for any Gymnasium-compatible environment (MiniGrid, Crafter, Chess, NetHack)

  • MOSAIC Random Worker: Baseline agents with random, no-op, and cycling action behaviours across all 26 environment families

Citing MOSAIC

If you use MOSAIC in your research, please cite the following paper:

@article{mousa2026mosaic,
  author  = {Abdulhamid M. Mousa and Yu Fu and Rakhmonberdi Khajiev and Jalaledin M. Azzabi and Abdulkarim M. Mousa and Peng Yang and Ming Liu},
  title   = {{MOSAIC}: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent {RL}, {LLM}, {VLM}, and Human Decision-Makers},
  year    = {2026},
  url     = {https://github.com/Abdulhamid97Mousa/MOSAIC}
}

References

[1] E. Liang et al., "RLlib: Abstractions for Distributed Reinforcement Learning," ICML, 2018.
[2] S. Huang et al., "CleanRL: High-quality Single-file Implementations of Deep RL Algorithms," JMLR, 2022.
[3] J. Weng et al., "Tianshou: A Highly Modularized Deep RL Library," JMLR, 2022.
[4] M. Hoffman et al., "Acme: A Research Framework for Distributed RL," arXiv:2006.00979, 2020.
[5] W. Liu et al., "XuanCe: A Comprehensive and Unified Deep RL Library," arXiv:2312.16248, 2023.
[6] S. Huang et al., "OpenRL: A Unified Reinforcement Learning Framework," arXiv:2312.16189, 2023.
[7] A. Raffin et al., "Stable-Baselines3: Reliable RL Implementations," JMLR, 2021.
[8] I. Caspi et al., "Reinforcement Learning Coach," 2017.
[9] D. Paglieri et al., "BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games," arXiv:2411.13543, 2024.
[10] G. De Magistris et al., "TextArena," 2025.
[11] D. Costarelli et al., "GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents," arXiv:2406.06613, 2024.
[12] Y. Huang et al., "lmgame-Bench: Evaluating LLMs on Game-Theoretic Decision-Making," 2025.
[13] M. Saplin, "LLM Chess," 2025.
[14] J. Guo et al., "LLM-Game-Bench: Evaluating LLM Reasoning through Game-Playing," 2024.
[15] M. Bettini et al., "BenchMARL: Benchmarking Multi-Agent Reinforcement Learning," JMLR, 2024. arXiv:2312.01472.
[16] X. Liu et al., "AgentBench: Evaluating LLMs as Agents," ICLR, 2024. arXiv:2308.03688.
[17] K. Zhu et al., "MultiAgentBench: Evaluating the Collaboration and Competition of LLM Agents," ACL, 2025. arXiv:2503.01935.
[18] Y. Lin et al., "GAMEBoT: Transparent Assessment of LLM Reasoning in Games," ACL, 2025. arXiv:2412.13602.
[19] H. Sun et al., "Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents," EMNLP, 2025. arXiv:2502.20073.
[20] L. Li et al., "BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors," arXiv:2602.13214, 2026.
[21] Z. Xi et al., "AgentGym: Evolving Large Language Model-based Agents across Diverse Environments," ACL, 2025. arXiv:2406.04151.
[22] Cipolina et al., "Game Reasoning Arena: A Comprehensive Evaluation Framework for Large Language Models," arXiv:2501.00363, 2025.
[23] Y. Wang et al., "CREW: A Benchmark for Collaborative Multi-Step Reasoning and Planning," NeurIPS, 2024.
[24] X. Ma et al., "LLM-PySC2: A Benchmark for Large Language Models in StarCraft II," arXiv:2412.19668, 2024.