Heterogeneous Decision-Maker

Heterogeneous Multi-Agent Ad-Hoc Teamwork: Different decision-making paradigms (RL, LLM, Human, Random) competing head-to-head in the same multi-agent environment. See Multi-Keyboard Support and IPC Architecture.

A heterogeneous setup is one where agents in the same experiment use different decision-making paradigms. For example, an RL-trained policy and an LLM playing side-by-side as teammates, or an RL agent competing against an LLM agent.

This is MOSAIC’s key innovation and what distinguishes it from every other RL or LLM framework.

The Research Gap

Existing frameworks are paradigm-siloed:

Framework

RL

LLM

Cross-Paradigm

RLlib, CleanRL, XuanCe

Yes

No

No

BALROG, AgentBench

No

Yes

No

TextArena

No

Yes (vs Human)

No

MOSAIC

Yes

Yes

Yes

No prior framework allowed fair, reproducible, head-to-head comparison between RL agents and LLM agents in the same multi-agent environment. The root cause is an interface mismatch. RL agents expect tensor observations and produce integer actions, while LLM agents expect text prompts and produce text responses.

The Gymnasium Analogy

Gymnasium (Towers et al., 2024) standardized the environment interface: every environment implements reset() and step(), so any algorithm can interact with any environment without modification.

No equivalent standardization existed for the agent side. MOSAIC’s Operator Protocol fills this gap:

        %%{init: {"flowchart": {"curve": "linear"}} }%%
graph LR
    subgraph "Gymnasium (Environments)"
        E1["MultiGrid Soccer"]
        E2["MiniGrid"]
        E3["Chess"]
    end

    EPROTO["reset() / step()<br/>Unified Env Interface"]

    subgraph "MOSAIC (Agents)"
        A1["RL Policy"]
        A2["LLM Agent"]
        A3["Human Player"]
    end

    APROTO["select_action(obs)<br/>Unified Agent Interface"]

    E1 --> EPROTO
    E2 --> EPROTO
    E3 --> EPROTO
    A1 --> APROTO
    A2 --> APROTO
    A3 --> APROTO

    style EPROTO fill:#4a90d9,stroke:#2e5a87,color:#fff
    style APROTO fill:#50c878,stroke:#2e8b57,color:#fff
    style E1 fill:#ddd,stroke:#999,color:#333
    style E2 fill:#ddd,stroke:#999,color:#333
    style E3 fill:#ddd,stroke:#999,color:#333
    style A1 fill:#ddd,stroke:#999,color:#333
    style A2 fill:#ddd,stroke:#999,color:#333
    style A3 fill:#ddd,stroke:#999,color:#333
    

Just as Gymnasium made environments interchangeable, the Operator Protocol makes agents interchangeable. Any decision-maker can be plugged into any compatible environment without modifying either side.

How Heterogeneous Teams Work

The WorkerAssignment system maps each agent slot in a multi-agent environment to a specific worker subprocess. A single OperatorConfig can freely mix RL, LLM, human, and baseline workers across agent slots:

# Heterogeneous team: RL + LLM in 2v2 soccer
# Note: All RL agents use one-to-many policy mapping via link groups
# (all agents share the same MAPPO checkpoint with agent-specific weights)
config = OperatorConfig.multi_agent(
    operator_id="heterogeneous_team",
    display_name="RL + LLM Heterogeneous vs RL + Random",
    env_name="mosaic_multigrid",
    task="MosaicMultiGrid-Soccer-2vs2-IndAgObs-v0",
    player_workers={
        # Green team: heterogeneous (RL + LLM)
        "agent_0": WorkerAssignment(
            worker_id="xuance_worker",
            worker_type="rl",
            settings={
                "algorithm": "mappo", # this green agent_0 was trained with this blue agent_1 in adversial setting
                "policy_path": "/path/to/mappo_agent_0_green_VS_agent_1_blue_checkpoint.pth",
            },
        ),
        "agent_1": WorkerAssignment(
            worker_id="mosaic_llm_worker",
            worker_type="llm",
            settings={
                "client_name": "openrouter",
                "model_id": "gpt-4o",
                "temperature": 0,
                "coordination_level": 2,
                "observation_mode": "visible_teammates",
            },
        ),
        # Blue team: RL + Random
        "agent_2": WorkerAssignment(
            worker_id="xuance_worker",
            worker_type="rl",
            settings={
                "algorithm": "mappo", # But then we can decide to link and deploy agent_0 with agent_3 or agent_2
                # to keep the same observation dimension
                "policy_path": "/path/to/mappo_agent_0_green_VS_agent_1_blue_checkpoint.pth",
            },
        ),
        "agent_3": WorkerAssignment(
            worker_id="random_worker",
            worker_type="random",
        ),
    },
    # Link groups for one-to-many policy mapping
    # (agents 0 and 2 share the same MAPPO checkpoint)
    link_groups={
        "operator_0_link_0": LinkGroup(
            group_id="operator_0_link_0",
            primary_agent="agent_0",
            linked_agents=["agent_2"],
            policy_path="/path/to/mappo_agent_0_green_VS_agent_1_blue_checkpoint.pth",
            algorithm="mappo",
            worker_type="rl",
        ),
    },
)

This creates four agent slots, each backed by a different subprocess:

        %%{init: {"flowchart": {"curve": "linear"}} }%%
graph TB
    ENV["MultiGrid Soccer 2v2<br/>(PettingZoo AEC)"]

    subgraph "Green Team (Heterogeneous)"
        G0["green_0: RL<br/>cleanrl_worker<br/>MAPPO checkpoint"]
        G1["green_1: LLM<br/>mosaic_llm_worker<br/>GPT-4o"]
    end

    subgraph "Blue Team (Crippled)"
        B0["blue_0: RL<br/>cleanrl_worker<br/>MAPPO checkpoint"]
        B1["blue_1: Random<br/>random_worker<br/>Random actions"]
    end

    ENV -- "obs" --> G0
    ENV -- "obs" --> G1
    ENV -- "obs" --> B0
    ENV -- "obs" --> B1
    G0 -- "action" --> ENV
    G1 -- "action" --> ENV
    B0 -- "action" --> ENV
    B1 -- "action" --> ENV

    style ENV fill:#4a90d9,stroke:#2e5a87,color:#fff
    style G0 fill:#50c878,stroke:#2e8b57,color:#fff
    style G1 fill:#50c878,stroke:#2e8b57,color:#fff
    style B0 fill:#ff7f50,stroke:#cc5500,color:#fff
    style B1 fill:#ff7f50,stroke:#cc5500,color:#fff
    

All four agents receive observations from the same environment, with the same seed, on the same timestep – yet each uses a completely different decision-making mechanism. The environment only sees select_action(obs) -> action, regardless of what runs inside.

Multi-Worker Pattern

The heterogeneous setup uses the multi-worker pattern: one Operator wraps N Worker subprocesses via the OperatorController protocol:

class OperatorController(Protocol):
    """Multi-agent extension of the Operator Protocol."""

    def select_action(
        self, agent_id: str, observation: Any, info: Any = None,
    ) -> Any:
        """AEC mode: one agent acts at a time."""
        ...

    def select_actions(
        self, observations: Dict[str, Any],
    ) -> Dict[str, Any]:
        """Parallel mode: all agents act simultaneously."""
        ...

Each worker runs as a separate OS process, communicating via JSONL-over-stdout. This process isolation means:

  • A crashed worker never takes down the GUI or other workers

  • Each worker can use different Python dependencies, GPU allocations, or even different Python versions

  • Integration effort is minimal and non-invasive to upstream libraries

Worker

LOC Added

Modifications to Original Library

CleanRL DQN (~300 LOC)

~50 LOC (harness)

Zero

BALROG Agent (~500 LOC)

~80 LOC (runtime.py)

Zero

XuanCe MAPPO (~2000 LOC)

~120 LOC (wrapper)

Zero

Experimental Configurations

Heterogeneous decision-making enables a systematic ablation matrix for cross-paradigm research. Here are examples using 2v2 soccer:

Adversarial Cross-Paradigm

Testing how paradigms perform against each other:

Configuration

Team A

Team B

Purpose

RL vs RL

MAPPO + MAPPO

MAPPO + MAPPO

Homogeneous RL baseline

LLM vs LLM

GPT-4o + GPT-4o

GPT-4o + GPT-4o

Homogeneous LLM baseline

RL vs LLM

MAPPO + MAPPO

GPT-4o + GPT-4o

Cross-paradigm matchup

RL vs Random

MAPPO + MAPPO

Random + Random

Sanity check

Cooperative Heterogeneous Teams

Testing how paradigms work together as teammates:

Configuration

Green Team

Blue Team

Heterogeneous vs Crippled

RL + LLM

RL + Random

Heterogeneous vs Solo

RL + LLM

RL + NoOp

Solo-pair vs Solo-pair

RL + RL

RL + RL

Heterogeneous vs Co-trained

RL + LLM

RL(2v2) + RL(2v2)

Important

The 1v1-to-2v2 transfer design is critical: RL agents are trained as solo experts (1v1), then deployed as teammates alongside an LLM in 2v2. This eliminates the co-training confound – the RL agent has no partner expectations because it never had a partner.

Deterministic Cross-Paradigm Evaluation

Shared seed schedules are distributed to all operators via OperatorService.seed(). Full trajectories are logged under unified telemetry. This produces directly comparable results across paradigms – the first time this has been possible.

        %%{init: {"flowchart": {"curve": "linear"}} }%%
graph LR
    SEEDS["Seed Schedule<br/>[42, 43, 44, ..., 141]"]

    SEEDS --> RL["RL Operator<br/>same 100 seeds"]
    SEEDS --> LLM["LLM Operator<br/>same 100 seeds"]
    SEEDS --> HYB["Heterogeneous Team<br/>same 100 seeds"]
    SEEDS --> BASE["Baseline<br/>same 100 seeds"]

    RL --> TEL["Unified Telemetry<br/>JSONL logs"]
    LLM --> TEL
    HYB --> TEL
    BASE --> TEL

    style SEEDS fill:#4a90d9,stroke:#2e5a87,color:#fff
    style RL fill:#9370db,stroke:#6a0dad,color:#fff
    style LLM fill:#9370db,stroke:#6a0dad,color:#fff
    style HYB fill:#50c878,stroke:#2e8b57,color:#fff
    style BASE fill:#ddd,stroke:#999,color:#333
    style TEL fill:#ff7f50,stroke:#cc5500,color:#fff
    

The MultiOperatorService manages N operators in parallel, each with its own environment instance but sharing the same seed:

class MultiOperatorService:
    def add_operator(self, config: OperatorConfig) -> None: ...
    def remove_operator(self, operator_id: str) -> None: ...
    def get_active_operators(self) -> list[OperatorConfig]: ...
    def start_all(self) -> None: ...
    def stop_all(self) -> None: ...

Heterogeneous Multi-Agent Ad-Hoc Teamwork in Adversarial Settings: Different decision-making paradigms (RL, LLM, Random) competing head-to-head in the same multi-agent environment.

GUI Integration

The heterogeneous decision-maker is configured through the OperatorsTab in the GUI, which provides two execution modes:

Manual Mode – step-by-step execution where the user clicks “Step All” or “Step Player” to advance each timestep. Useful for debugging and observing agent behavior:

        %%{init: {"flowchart": {"curve": "linear"}} }%%
graph LR
    USER["User"]
    OT["OperatorsTab<br/>Manual Mode"]
    OCW["OperatorConfigWidget<br/>(up to 8 operators)"]
    ORC["OperatorRenderContainer<br/>(per-operator viewport)"]

    USER -- "configure" --> OCW
    USER -- "Step All" --> OT
    OT -- "step signal" --> ORC

    style USER fill:#eee,stroke:#999,color:#333
    style OT fill:#4a90d9,stroke:#2e5a87,color:#fff
    style OCW fill:#4a90d9,stroke:#2e5a87,color:#fff
    style ORC fill:#4a90d9,stroke:#2e5a87,color:#fff
    

Script Mode – automated batch experiments that run N episodes across M seed values without user interaction. Managed by OperatorScriptExecutionManager:

        %%{init: {"flowchart": {"curve": "linear"}} }%%
graph LR
    SCRIPT["ScriptExperimentWidget<br/>define seed range + episodes"]
    MGR["ScriptExecutionManager<br/>state machine"]
    OPS["MultiOperatorService<br/>N operators in parallel"]
    TEL["Unified Telemetry<br/>JSONL logs per operator"]

    SCRIPT --> MGR
    MGR --> OPS
    OPS --> TEL

    style SCRIPT fill:#4a90d9,stroke:#2e5a87,color:#fff
    style MGR fill:#50c878,stroke:#2e8b57,color:#fff
    style OPS fill:#ff7f50,stroke:#cc5500,color:#fff
    style TEL fill:#ff7f50,stroke:#cc5500,color:#fff
    

Each operator gets its own OperatorRenderContainer with a color-coded type badge (LLM=blue, RL=purple, Human=orange) and live observation rendering, making it easy to visually compare how different paradigms behave on the same environment state.