Heterogeneous Decision-Maker¶

Heterogeneous Multi-Agent Ad-Hoc Teamwork: Different decision-making paradigms (RL, LLM, Human, Random) competing head-to-head in the same multi-agent environment. See Multi-Keyboard Support and IPC Architecture.

A heterogeneous setup is one where agents in the same experiment use different decision-making paradigms. For example, an RL-trained policy and an LLM playing side-by-side as teammates, or an RL agent competing against an LLM agent.

This is MOSAIC’s key innovation and what distinguishes it from every other RL or LLM framework.

The Research Gap¶

Existing frameworks are paradigm-siloed:

Framework	RL	LLM	Cross-Paradigm
RLlib, CleanRL, XuanCe	Yes	No	No
BALROG, AgentBench	No	Yes	No
TextArena	No	Yes (vs Human)	No
MOSAIC	Yes	Yes	Yes

No prior framework allowed fair, reproducible, head-to-head comparison between RL agents and LLM agents in the same multi-agent environment. The root cause is an interface mismatch. RL agents expect tensor observations and produce integer actions, while LLM agents expect text prompts and produce text responses.

The Gymnasium Analogy¶

Gymnasium (Towers et al., 2024) standardized the environment interface: every environment implements reset() and step(), so any algorithm can interact with any environment without modification.

No equivalent standardization existed for the agent side. MOSAIC’s Operator Protocol fills this gap:

        %%{init: {"flowchart": {"curve": "linear"}} }%%
graph LR
    subgraph "Gymnasium (Environments)"
        E1["MultiGrid Soccer"]
        E2["MiniGrid"]
        E3["Chess"]
    end

    EPROTO["reset() / step()<br/>Unified Env Interface"]

    subgraph "MOSAIC (Agents)"
        A1["RL Policy"]
        A2["LLM Agent"]
        A3["Human Player"]
    end

    APROTO["select_action(obs)<br/>Unified Agent Interface"]

    E1 --> EPROTO
    E2 --> EPROTO
    E3 --> EPROTO
    A1 --> APROTO
    A2 --> APROTO
    A3 --> APROTO

    style EPROTO fill:#4a90d9,stroke:#2e5a87,color:#fff
    style APROTO fill:#50c878,stroke:#2e8b57,color:#fff
    style E1 fill:#ddd,stroke:#999,color:#333
    style E2 fill:#ddd,stroke:#999,color:#333
    style E3 fill:#ddd,stroke:#999,color:#333
    style A1 fill:#ddd,stroke:#999,color:#333
    style A2 fill:#ddd,stroke:#999,color:#333
    style A3 fill:#ddd,stroke:#999,color:#333

Just as Gymnasium made environments interchangeable, the Operator Protocol makes agents interchangeable. Any decision-maker can be plugged into any compatible environment without modifying either side.

How Heterogeneous Teams Work¶

The WorkerAssignment system maps each agent slot in a multi-agent environment to a specific worker subprocess. A single OperatorConfig can freely mix RL, LLM, human, and baseline workers across agent slots:

# Heterogeneous team: RL + LLM in 2v2 soccer
# Note: All RL agents use one-to-many policy mapping via link groups
# (all agents share the same MAPPO checkpoint with agent-specific weights)
config = OperatorConfig.multi_agent(
    operator_id="heterogeneous_team",
    display_name="RL + LLM Heterogeneous vs RL + Random",
    env_name="mosaic_multigrid",
    task="MosaicMultiGrid-Soccer-2vs2-IndAgObs-v0",
    player_workers={
        # Green team: heterogeneous (RL + LLM)
        "agent_0": WorkerAssignment(
            worker_id="xuance_worker",
            worker_type="rl",
            settings={
                "algorithm": "mappo", # this green agent_0 was trained with this blue agent_1 in adversial setting
                "policy_path": "/path/to/mappo_agent_0_green_VS_agent_1_blue_checkpoint.pth",
            },
        ),
        "agent_1": WorkerAssignment(
            worker_id="mosaic_llm_worker",
            worker_type="llm",
            settings={
                "client_name": "openrouter",
                "model_id": "gpt-4o",
                "temperature": 0,
                "coordination_level": 2,
                "observation_mode": "visible_teammates",
            },
        ),
        # Blue team: RL + Random
        "agent_2": WorkerAssignment(
            worker_id="xuance_worker",
            worker_type="rl",
            settings={
                "algorithm": "mappo", # But then we can decide to link and deploy agent_0 with agent_3 or agent_2
                # to keep the same observation dimension
                "policy_path": "/path/to/mappo_agent_0_green_VS_agent_1_blue_checkpoint.pth",
            },
        ),
        "agent_3": WorkerAssignment(
            worker_id="random_worker",
            worker_type="random",
        ),
    },
    # Link groups for one-to-many policy mapping
    # (agents 0 and 2 share the same MAPPO checkpoint)
    link_groups={
        "operator_0_link_0": LinkGroup(
            group_id="operator_0_link_0",
            primary_agent="agent_0",
            linked_agents=["agent_2"],
            policy_path="/path/to/mappo_agent_0_green_VS_agent_1_blue_checkpoint.pth",
            algorithm="mappo",
            worker_type="rl",
        ),
    },
)

This creates four agent slots, each backed by a different subprocess:

        %%{init: {"flowchart": {"curve": "linear"}} }%%
graph TB
    ENV["MultiGrid Soccer 2v2<br/>(PettingZoo AEC)"]

    subgraph "Green Team (Heterogeneous)"
        G0["green_0: RL<br/>cleanrl_worker<br/>MAPPO checkpoint"]
        G1["green_1: LLM<br/>mosaic_llm_worker<br/>GPT-4o"]
    end

    subgraph "Blue Team (Crippled)"
        B0["blue_0: RL<br/>cleanrl_worker<br/>MAPPO checkpoint"]
        B1["blue_1: Random<br/>random_worker<br/>Random actions"]
    end

    ENV -- "obs" --> G0
    ENV -- "obs" --> G1
    ENV -- "obs" --> B0
    ENV -- "obs" --> B1
    G0 -- "action" --> ENV
    G1 -- "action" --> ENV
    B0 -- "action" --> ENV
    B1 -- "action" --> ENV

    style ENV fill:#4a90d9,stroke:#2e5a87,color:#fff
    style G0 fill:#50c878,stroke:#2e8b57,color:#fff
    style G1 fill:#50c878,stroke:#2e8b57,color:#fff
    style B0 fill:#ff7f50,stroke:#cc5500,color:#fff
    style B1 fill:#ff7f50,stroke:#cc5500,color:#fff

All four agents receive observations from the same environment, with the same seed, on the same timestep – yet each uses a completely different decision-making mechanism. The environment only sees select_action(obs) -> action, regardless of what runs inside.

Multi-Worker Pattern¶

The heterogeneous setup uses the multi-worker pattern: one Operator wraps N Worker subprocesses via the OperatorController protocol:

class OperatorController(Protocol):
    """Multi-agent extension of the Operator Protocol."""

    def select_action(
        self, agent_id: str, observation: Any, info: Any = None,
    ) -> Any:
        """AEC mode: one agent acts at a time."""
        ...

    def select_actions(
        self, observations: Dict[str, Any],
    ) -> Dict[str, Any]:
        """Parallel mode: all agents act simultaneously."""
        ...

Each worker runs as a separate OS process, communicating via JSONL-over-stdout. This process isolation means:

A crashed worker never takes down the GUI or other workers
Each worker can use different Python dependencies, GPU allocations, or even different Python versions
Integration effort is minimal and non-invasive to upstream libraries

Worker	LOC Added	Modifications to Original Library
CleanRL DQN (~300 LOC)	~50 LOC (harness)	Zero
BALROG Agent (~500 LOC)	~80 LOC (runtime.py)	Zero
XuanCe MAPPO (~2000 LOC)	~120 LOC (wrapper)	Zero

Experimental Configurations¶

Heterogeneous decision-making enables a systematic ablation matrix for cross-paradigm research. Here are examples using 2v2 soccer:

Adversarial Cross-Paradigm¶

Testing how paradigms perform against each other:

Configuration	Team A	Team B	Purpose
RL vs RL	MAPPO + MAPPO	MAPPO + MAPPO	Homogeneous RL baseline
LLM vs LLM	GPT-4o + GPT-4o	GPT-4o + GPT-4o	Homogeneous LLM baseline
RL vs LLM	MAPPO + MAPPO	GPT-4o + GPT-4o	Cross-paradigm matchup
RL vs Random	MAPPO + MAPPO	Random + Random	Sanity check

Cooperative Heterogeneous Teams¶

Testing how paradigms work together as teammates:

Configuration	Green Team	Blue Team
Heterogeneous vs Crippled	RL + LLM	RL + Random
Heterogeneous vs Solo	RL + LLM	RL + NoOp
Solo-pair vs Solo-pair	RL + RL	RL + RL
Heterogeneous vs Co-trained	RL + LLM	RL(2v2) + RL(2v2)

Important

The 1v1-to-2v2 transfer design is critical: RL agents are trained as solo experts (1v1), then deployed as teammates alongside an LLM in 2v2. This eliminates the co-training confound – the RL agent has no partner expectations because it never had a partner.

Deterministic Cross-Paradigm Evaluation¶

Shared seed schedules are distributed to all operators via OperatorService.seed(). Full trajectories are logged under unified telemetry. This produces directly comparable results across paradigms – the first time this has been possible.

        %%{init: {"flowchart": {"curve": "linear"}} }%%
graph LR
    SEEDS["Seed Schedule<br/>[42, 43, 44, ..., 141]"]

    SEEDS --> RL["RL Operator<br/>same 100 seeds"]
    SEEDS --> LLM["LLM Operator<br/>same 100 seeds"]
    SEEDS --> HYB["Heterogeneous Team<br/>same 100 seeds"]
    SEEDS --> BASE["Baseline<br/>same 100 seeds"]

    RL --> TEL["Unified Telemetry<br/>JSONL logs"]
    LLM --> TEL
    HYB --> TEL
    BASE --> TEL

    style SEEDS fill:#4a90d9,stroke:#2e5a87,color:#fff
    style RL fill:#9370db,stroke:#6a0dad,color:#fff
    style LLM fill:#9370db,stroke:#6a0dad,color:#fff
    style HYB fill:#50c878,stroke:#2e8b57,color:#fff
    style BASE fill:#ddd,stroke:#999,color:#333
    style TEL fill:#ff7f50,stroke:#cc5500,color:#fff

The MultiOperatorService manages N operators in parallel, each with its own environment instance but sharing the same seed:

class MultiOperatorService:
    def add_operator(self, config: OperatorConfig) -> None: ...
    def remove_operator(self, operator_id: str) -> None: ...
    def get_active_operators(self) -> list[OperatorConfig]: ...
    def start_all(self) -> None: ...
    def stop_all(self) -> None: ...

Heterogeneous Multi-Agent Ad-Hoc Teamwork in Adversarial Settings: Different decision-making paradigms (RL, LLM, Random) competing head-to-head in the same multi-agent environment.

GUI Integration¶

The heterogeneous decision-maker is configured through the OperatorsTab in the GUI, which provides two execution modes:

Manual Mode – step-by-step execution where the user clicks “Step All” or “Step Player” to advance each timestep. Useful for debugging and observing agent behavior:

        %%{init: {"flowchart": {"curve": "linear"}} }%%
graph LR
    USER["User"]
    OT["OperatorsTab<br/>Manual Mode"]
    OCW["OperatorConfigWidget<br/>(up to 8 operators)"]
    ORC["OperatorRenderContainer<br/>(per-operator viewport)"]

    USER -- "configure" --> OCW
    USER -- "Step All" --> OT
    OT -- "step signal" --> ORC

    style USER fill:#eee,stroke:#999,color:#333
    style OT fill:#4a90d9,stroke:#2e5a87,color:#fff
    style OCW fill:#4a90d9,stroke:#2e5a87,color:#fff
    style ORC fill:#4a90d9,stroke:#2e5a87,color:#fff

Script Mode – automated batch experiments that run N episodes across M seed values without user interaction. Managed by OperatorScriptExecutionManager:

        %%{init: {"flowchart": {"curve": "linear"}} }%%
graph LR
    SCRIPT["ScriptExperimentWidget<br/>define seed range + episodes"]
    MGR["ScriptExecutionManager<br/>state machine"]
    OPS["MultiOperatorService<br/>N operators in parallel"]
    TEL["Unified Telemetry<br/>JSONL logs per operator"]

    SCRIPT --> MGR
    MGR --> OPS
    OPS --> TEL

    style SCRIPT fill:#4a90d9,stroke:#2e5a87,color:#fff
    style MGR fill:#50c878,stroke:#2e8b57,color:#fff
    style OPS fill:#ff7f50,stroke:#cc5500,color:#fff
    style TEL fill:#ff7f50,stroke:#cc5500,color:#fff

Each operator gets its own OperatorRenderContainer with a color-coded type badge (LLM=blue, RL=purple, Human=orange) and live observation rendering, making it easy to visually compare how different paradigms behave on the same environment state.